Why use provisioned IOPS volumes for AWS databases?

MySQL Performance Blog - Wed, 2016-06-01 22:05

In this blog, we’ll use some test results to look at the rationale for using provisioned IOPS volumes for AWS databases.

One piece of advice you often hear running MySQL, MongoDB or other databases in the AWS EC2 environment is that you should use volumes with provisioned IOPs. This kind of makes sense on the “marketing” level, where provisioned IOPS (io1) volumes are designed for IO-intensive database workloads, while General Purpose (gp2) volumes are not. But if you go to the AWS volume type description, you will find that gp2s are shown to have pretty good IO performance. So where do all these supposed database performance problems for Amazon Elastic Block Store (EBS), with no provisioned IOs, come from?

Here is what I found out running experiments with a beta of Percona Monitoring and Management.

I ran a typical database instance workload, where the OLTP workload uses around 20% of the system capacity, and periodically I have a single user IO intensive batch job hitting the same system. Even if you do not have batch jobs running, your backup is likely to show this same IO pattern.

What would happen in this case if you have conventional local storage? Some queueing happens on the storage level, but as there is only one user with intensive IO, the impact is typically not very significant. What do we see from the AWS gp2 volume?

At first, the read services spike to more than 1.5K IOPS, and while latency increases from normal 1-2ms, it remains below 10ms on average. However, after a couple of minutes IOPS drops to around 500, and read latency spikes to over 100ms (note the log scale on the graph).

What is happening here? The gp2 volumes behave differently than your conventional storage by allowing IO bursts for short periods of time – after a short period of time, however, the IOs are throttled (in this case to only 500/sec). How does the throttling work? By adding delay to IO completion so that only the required IOs are completed per second, and the more concurrency we add to such throttled devices, the higher the average IO response latency is!

What does this mean from an application point of view? Let’s say you have a database transaction that requires 100 reads from the disk. If you have an average of 1ms latency, this transaction takes about 100ms reading from the disk, and will likely be seen as very good user experience. If you have an average IO latency of 100ms, the same transaction spends ten seconds reading from the disk – well above the tolerance for many users.   

As a DBA, you can see how putting an extra (small) load on the database system (such as running batch job or backup) can cause your boss to come screaming that the website is down ten minutes later.

There is another key difference between conventional local storage such as RAID or SSD, and an EBS volume. Not all local storage IO is created equal, while an EBS general purpose volume seems to inject latencies into IO operations independent of what the IO is.

Transactional log flushes are one of the most latency critical IO operations databases perform. These are very small (often just 1 page) sequential writes. RAID controllers and SSDs can handle these very quickly by only writing in memory (battery or capacitor backed up), at a fraction of the costs of other operations. This is not the case for EBS gp2: log writes come with high latency.

We can see this latency in Performance Schema graphs, where such patch jobs correlate to a huge amount of time spent writing to the InnoDB Transactional Log file or Binary Log File:

We can also see the main InnoDB thread spending up to 30% of its time flushing the log – the number is drastically lower for typical storage configuration:

Another way AWS EBS storage is different from the typical local storage is that size directly buys you performance. GP2 volumes provide 3 IOPS/GB, up to 10000 IOPS (99 percentile figure),  which means that larger storage will have higher performance – though if anything, this means you’re getting better performance from your larger production volumes than your smaller test ones.

A final note: EBS storage is essentially connected to a network, which means both slightly higher latencies and limited throughput. According to the documentation, there is 160MiB/s throughput limit per volume, which is a lot less than even inexpensive SATA SSD. SSD often can provide 500MB/sec or more, and are generally limited by SATA bus capacity.  

My takeaways from these results:

  • EBS General Purpose volumes have decent performance for light-duty workloads – if you don’t demand a lot of IOPS from your storage for prolonged periods of time. If you do, storage with provisioned IOPS is a better choice
  • Whenever you’re using Amazon or other environments with multi-tenant virtualized storage, I would highly suggest running some benchmark on how it behaves for the above scenarios. The assumptions you have about your conventional RAID or SSD storage might not apply.

Want to play around with live graphs? Check out our PMM Demo, which is currently running the stated workload on Amazon EC2. You can also install the beta version to use with your own system.


Categories: MySQL

Troubleshooting locking issues webinar: Q & A

MySQL Performance Blog - Tue, 2016-05-31 23:58

In this blog, I will provide answers to the Q & A for the Troubleshooting locking issues webinar.

First, I want to thank you for attending the May, 12 webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: Do you have the links to those other info sources?

A: Yes, they are listed in the “More Information” slide. In the PDF, all the links are active. If you speak Russian, you can also check this presentation by Dmitry Lenev. He also did a similar presentation in English for MySQL Connect, but now all the content is gone from the official website, so only chance to find his slides in English is to search web archives.

Q: Are you going to discuss metadata locks?

A: Yes. I discuss them in slides 11-16.

Q: Why do row locking when table level lock is already set by InnoDB. My question was table level lock is already set. You update 100 rows in that table, but InnoDB locks these 100 rows. Why? The table is already locked . . .

A: Table lock, which you saw on slide #20, is set by InnoDB only for a short time and almost immediately released. But the transaction not closed yet, and InnoDB still needs to protect updated rows from modifications by other transactions. Why can’t it be done with table-level lock only? Imagine you have a table with 1,000,000 rows. All have an ID from 1 to 1,000,000 (and other fields). Now imagine you need to update the row with ID=1. In the case of table lock, the whole table is locked while you are performing this one update. If another connection wants to update a row with ID=202, it has to wait. In the case of row-level locks, the two queries do not interfere each other and can apply in parallel.

Q: How do you avoid locks on alter, without resetting that transaction?

A: If you are using version 5.6 and up, many ALTER commands are non-locking. See the overview of online DDL in the user manual. However, if you want to use an ALTER variation that cannot be done online, you can use the utility pt-online-schema-change from Percona Toolkit. Note that ALTER will take longer than the regular “blocking” variant, but it will not block your other connections.

Q: This is not a question, but there is a typo on the slide – it should be Intention Locks, not Intension Locks

A: Thank you! I fixed this and the wrong table name in slide #28. Please download updated version of slides.

Q: Why does the ALTER table operation have to wait forever? It should start once the transaction finished, but I know that the lock will remain. Why doesn’t it unlock when the transaction is finished?

A: Of course it doesn’t wait forever! It was just an acronym for “waits very long time,” which can happen if you have a very busy application, with many threads updating the same table. Or if you don’t close transactions.

Q: Is the metadata_locks table enabled by default?

A: Yes.

Categories: MySQL

What is a big innodb_log_file_size?

MySQL Performance Blog - Tue, 2016-05-31 15:45

In this post, we’ll discuss what constitutes a big innodb_log_file_size, and how it can affect performance.

In the comments for our post on Percona Server 5.7 performance improvements, someone asked why we use innodb_log_file_size=10G with an indication that it might be too big?

In my previous post (, the example used innodb_log_file_size=15G. Is that too big? Let’s take a more detailed look at this.

First, let me start by rephrasing my warning: the log file size should be set as big as possible, but not bigger than necessary. A bigger log file size is better for performance, but it has a drawback (a significant one) that you need to worry about: the recovery time after a crash. You need to balance recovery time in the rare event of a crash recovery versus maximizing throughput during peak operations. This limitation can translate to a 20x longer crash recovery process!

But how big is “big enough”? Is it 48MB (the default value), 1-2GB (which I often see in production), or 10-15GB (like we use for benchmarks)?

I wrote about how the innodb_log_file_size is related to background flushing five years ago, and I recommend this post if you are interested in details:

InnoDB Flushing: Theory and solutions

Since that time many improvements have been made both in Percona Server and MySQL, but a small innodb_log_file_size still affects the throughput.

How? Let’s review how writes happen in InnoDB. Practically all data page writes happen in the background. It seems like background writes shouldn’t affect user query performance, but it does. The more intense background writes are, the more resources are taken away from the user foreground workload. There are three big forces that rule background writes:

  1. How close checkpoint age is to the async point (again, see previous material This is adaptive flushing.
  2. How close is innodb_max_dirty_pages_pct to the percentage of actual dirty pages.  You can see this in the LRU flushing metrics.
  3. What amount of free pages are defined by innodb_lru_scan_depth. This is also in LRU flushing metrics.

So in this equation innodb_log_file_size defines the async point, and how big checkpoint age can be.

To show a practical application of these forces, I’ve provided some chart data. I will use charts from the Percona Monitoring and Management tool and data from Percona Server 5.7.

Before jumping to graphs, let me remind you that the max checkpoint age is defined not only by innodb_log_file_size, but also innodb_log_files_in_group (which is usually “2” by default). So innodb_log_file_size=2GB will have 4GB of log space, from which MySQL will use about 3.24GB (MySQL makes extra reservations to avoid a situation when we fully run out of log space).

Below are graphs from a tpcc-mysql benchmark with 1500 warehouses, which provides about 150GB of data. I used innodb_buffer_pool_size=64GB, and I made two runs:

  1. with innodb_log_file_size=2GB
  2. with innodb_log_file_size=15GB

Other details about my setup:

  • CPU: 56 logical CPU threads servers Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
  • OS: Ubuntu 16.04
  • Kernel 4.4.0-21-generic
  • The storage device is Samsung SM863 SATA SSD, single device, with ext4 filesystem
  • MySQL versions: Percona Server 5.7.11
  • innodb_io_capacity=5000 / innodb_io_capacity_max=7500

On the first chart, let’s look at the max checkpoint age, current checkpoint age and amount of flushed pages per second:

. . . and also a related graph of how many pages are flushed by different forces (LRU flushing and adaptive flushing). You can receive this data by enabling innodb_monitor_enable = '%'.

From these charts, we can see that with 2GB innodb_log_file_size InnoDB is forced by adaptive flushing to flush (write) more pages, because the current checkpoint age (uncheckpointed bytes) is very close to Max Checkpoint Age. To see the checkpoint age in MySQL, you can use the innodb_metrics table and metrics recovery_log_lsn_checkpoint_age and recovery_log_max_modified_age_sync.

In the case using innodb_log_file_size=15GB, the main flushing is done via LRU flushing (to keep 5000 pages (innodb_lru_scan_depth) free per buffer pool instance). From the first graph we can figure that uncheckpointed bytes never reach 12GB, so in this case using innodb_log_file_size=15GB is overkill. We might be fine with innodb_log_file_size=8GB – but we wouldn’t know unless we set the innodb_log_file_size big enough. MySQL 5.7 comes with a very convenient improvement: now it is much easier to change the innodb_log_file_size, but it still requires a server restart. I wish we could change it online, like we can for innodb_buffer_pool_size (I do not see technical barriers for this).

Let’s also look into the InnoDB buffer pool content:

We can see that there are more modified pages in the case with 15GB log files (which is good, as more modified pages means less work done in the background).

And the most interesting question: how does it affect throughput?

With innodb_log_file_size=2GB, the throughput is about 20% worse. With a 2GB log size, you can see that often zero transactions are processed within one second – this is bad, and says that the flushing algorithm still needs improvements in cases when the checkpoint age is close to or at the async point.

This should make a convincing case that using big innodb_log_file_size is beneficial. In this particular case, probably 8GB (with innodb_log_files_in_group=2) would be enough.

What about the recovery time? To measure this, I killed mysqld when the checkpoint age (uncheckpointed bytes) was about 10GB. It appeared to take 20 mins to start mysqld after the crash. In another experiment with 25GB of uncheckpointed bytes, it took 45 mins. Unfortunately, crash recovery in MySQL is still singlethreaded, so it takes a long time to read and apply 10GB worth of changes (even from the fast storage).

We can see that recovery is single-threaded from the CPU usage chart during recovery:

The system uses 2% of the CPU (which corresponds to a single CPU).

In many cases, crash recovery is not a huge concern. People don’t always have to wait for MySQL recovery – since even one minute of downtime can be too long, often the instance fails over to a slave (especially with async replication), or the crashed node just leaves the cluster (if you use Percona XtraDB Cluster).

I would still like to see improvements in this area. Crash recovery is the biggest showstopper for using a big innodb_log_file_size, and I think it is possible to add parallelism similar to multithreaded slaves into the crash recovery process.

You can find the raw results, scripts and configs here.


Categories: MySQL

Galera Error Failed to Report Last Committed (Interrupted System Call)

MySQL Performance Blog - Fri, 2016-05-27 22:00

In this blog, we’ll discuss the ramifications of the Galera Error Failed to Report Last Committed (Interrupted System Call).

I have recently seen this error with Percona XtraDB Cluster (or Galera):

[Warning] WSREP: Failed to report last committed 549684236, -4 (Interrupted system call)

It was posted in launchpad as a bug in 2013:

My colleague Przemek replied, and explained it as:

Reporting the last committed transaction is just a part of the certification index purge process. In case it fails for some reason (it occasionally does), the cert index purge may be a little delayed. But it does not mean the transaction was not applied successfully. This is a warning after all.

If we look up this error in the source code, we realize it is reusing Linux system errors. Specifically:

#define EINTR 4 /* Interrupted system call */

As there isn’t much documentation regarding this error, and internet searches did not bring up useful information, my colleague David Bennett and I delved into the source code (as we do on occasion).

If we look in the Galera source code gcs_sm.hpp we see:

289  * @retval -EINTR  - was interrupted by another thread

We also see:

317                 /* was interrupted, will be handled by someone else */

This means that the thread was interrupted, but the server will retry on another thread. As it is just a warning, it isn’t anything to be too concerned about – unless they begin to pile up (which could be a sign of concurrency issues).

The specific warning is thrown from galera_service_thd.cpp here:

58                 if (gu_unlikely(ret < 0))
59                 {
60                     log_warn << "Failed to report last committed "
61                              << data.last_committed_ << ", " << ret
62                              << " (" << strerror (-ret) << ')';
63                     // @todo: figure out what to do in this case
64                 }

This warning could be handled better so as to not bloody the logs, or sound cryptic enough to concern administrators.

Categories: MySQL

Asynchronous Query Execution with MySQL 5.7 X Plugin

MySQL Performance Blog - Fri, 2016-05-27 20:58

In this blog, we will discuss MySQL 5.7 asynchronous query execution using the X Plugin.


MySQL 5.7 supports X Plugin / X Protocol, which allows (if the library supports it) asynchronous query execution. In 2014, I published a blog on how to increase a slow query performance with the parallel query execution. There, I created a prototype in the bash shell. Here, I’ve tried a similar idea with NodeJS + mysqlx library (which uses MySQL X Plugin).

TL;DR version: By using the MySQL X Plugin with NodeJS I was able to increase query performance 10x (some query rewrite required).

X Protocol and NodeJS

Here are the steps required:

  1. First, we will need to enable X Plugin in MySQL 5.7.12+, which will use a different port (33060 by default).
  2. Second, download and install NodeJS (>4.2) and mysql-connector-nodejs-1.0.2.tar.gz (follow Getting Started with Connector/Node.JS guide).
    # node --version v4.4.4 # wget # npm install mysql-connector-nodejs-1.0.2.tar.gz
    Please note: on older systems, you will probably need to upgrade the nodejs version. Follow the Installing Node.js via package manager guide.
  3. All set! Now we can use the asynchronous queries feature.

Test data 

I’m using the same Wikipedia Page Counts dataset (wikistats) I’ve used for my Apache Spark and MySQL example. Let’s imagine we want to compare the popularity of MySQL versus PostgeSQL in January 2008 (comparing the total page views). Here are the sample queries:

mysql> select sum(tot_visits) from wikistats_by_day_spark where url like '%mysql%'; mysql> select sum(tot_visits) from wikistats_by_day_spark where url like '%postgresql%';

The table size only holds data for English Wikipedia for January 2008, but still has ~200M rows and ~16G in size. Both queries run for ~5 minutes each, and utilize only one CPU core (one connection = one CPU core). The box has 24 CPU cores, Intel(R) Xeon(R) CPU L5639 @ 2.13GHz. Can we run the query in parallel, utilizing all cores?

That is possible now with NodeJS and X Plugin, but require some preparation:

  1. Partition the table using hash, 24 partitions:
    CREATE TABLE `wikistats_by_day_spark_part` ( `id` int(11) NOT NULL AUTO_INCREMENT, `mydate` date NOT NULL, `url` text, `cnt` bigint(20) NOT NULL, `tot_visits` bigint(20) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=239863472 DEFAULT CHARSET=latin1 /*!50100 PARTITION BY HASH (id) PARTITIONS 24 */
  2. Rewrite the query running one connection (= one thread) per each partition, choosing its own partition for each thread:
    select sum(tot_visits) from wikistats_by_day_spark_part partition (p<N>) where url like '%mysql%';
  3. Wrap it up inside the NodeJS Callback functions / Promises.

The code

var mysqlx = require('mysqlx'); var cs_pre = { host: 'localhost', port: 33060, dbUser: 'root', dbPassword: 'mysql' }; var cs = { host: 'localhost', port: 33060, dbUser: 'root', dbPassword: 'mysql' }; var partitions = []; var res = []; var total = 0; mysqlx.getNodeSession( cs_pre ).then(session_pre => { var sql="select partition_name from information_schema.partitions where table_name = 'wikistats_by_day_spark_part' and table_schema = 'wikistats' "; session_pre.executeSql(sql) .execute(function (row) { partitions.push(row); }).catch(err => { console.log(err); }) .then( function () { partitions.forEach(function(p) { mysqlx.getNodeSession( cs ).then(session => { var sql="select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(" + p + ") where url like '%mysql%';" console.log("Started SQL for partiton: " + p); return Promise.all([ session.executeSql(sql) .execute(function (row) { console.log(p + ":" + row); res.push(row); total = Number(total) + Number(row); }).catch(err => { console.log(err); }), session.close() ]); }).catch(err => { console.log(err + "partition: " + p); }).then(function() { // All done if (res.length == partitions.length) { console.log("All done! Total: " + total); // can now sort "res" array if needed an display } }); }); }); session_pre.close(); }); console.log("Starting...");

The explanation

The idea here is rather simple:

  1. Find all the partitions for the table by using “select partition_name from information_schema.partitions”
  2. For each partition, run the query in parallel: create a connection, run the query with a specific partition name, define the callback function, then close the connection.
  3. As the callback function is used, the code will not be blocked, but rather proceed to the next iteration. When the query is finished, the callback function will be executed.
  4. Inside the callback function, I’m saving the result into an array and also calculating the total (actually I only need a total in this example).
    .execute(function (row) { console.log(p + ":" + row); res.push(row); total = Number(total) + Number(row); ...

Asynchronous Salad: tomacucumtoes,bersmayonn,aise *

This may blow your mind: because everything is running asynchronously, the callback functions will return when ready. Here is the result of the above script:

$ time node async_wikistats.js Starting... Started SQL for partiton: p0 Started SQL for partiton: p1 Started SQL for partiton: p2 Started SQL for partiton: p3 Started SQL for partiton: p4 Started SQL for partiton: p5 Started SQL for partiton: p7 Started SQL for partiton: p8 Started SQL for partiton: p6 Started SQL for partiton: p9 Started SQL for partiton: p10 Started SQL for partiton: p12 Started SQL for partiton: p13 Started SQL for partiton: p11 Started SQL for partiton: p14 Started SQL for partiton: p15 Started SQL for partiton: p16 Started SQL for partiton: p17 Started SQL for partiton: p18 Started SQL for partiton: p19 Started SQL for partiton: p20 Started SQL for partiton: p21 Started SQL for partiton: p22 Started SQL for partiton: p23

… here the script will wait for the async calls to return, and they will return when ready – the order is not defined.

Meanwhile, we can watch MySQL processlist:

+------+------+-----------------+-----------+---------+-------+--------------+-------------------------------------------------------------------------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +------+------+-----------------+-----------+---------+-------+--------------+-------------------------------------------------------------------------------------------------------------------+ | 186 | root | localhost:44750 | NULL | Sleep | 21391 | cleaning up | PLUGIN | | 2290 | root | localhost | wikistats | Sleep | 1417 | | NULL | | 2510 | root | localhost:41737 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p0) where url like '%mysql%' | | 2511 | root | localhost:41738 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p1) where url like '%mysql%' | | 2512 | root | localhost:41739 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p2) where url like '%mysql%' | | 2513 | root | localhost:41741 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p4) where url like '%mysql%' | | 2514 | root | localhost:41740 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p3) where url like '%mysql%' | | 2515 | root | localhost:41742 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p5) where url like '%mysql%' | | 2516 | root | localhost:41743 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p6) where url like '%mysql%' | | 2517 | root | localhost:41744 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p7) where url like '%mysql%' | | 2518 | root | localhost:41745 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p8) where url like '%mysql%' | | 2519 | root | localhost:41746 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p9) where url like '%mysql%' | | 2520 | root | localhost:41747 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p10) where url like '%mysql%' | | 2521 | root | localhost:41748 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p11) where url like '%mysql%' | | 2522 | root | localhost:41749 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p12) where url like '%mysql%' | | 2523 | root | localhost:41750 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p13) where url like '%mysql%' | | 2524 | root | localhost:41751 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p14) where url like '%mysql%' | | 2525 | root | localhost:41752 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p15) where url like '%mysql%' | | 2526 | root | localhost:41753 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p16) where url like '%mysql%' | | 2527 | root | localhost:41754 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p17) where url like '%mysql%' | | 2528 | root | localhost:41755 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p18) where url like '%mysql%' | | 2529 | root | localhost:41756 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p19) where url like '%mysql%' | | 2530 | root | localhost:41757 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p20) where url like '%mysql%' | | 2531 | root | localhost:41758 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p21) where url like '%mysql%' | | 2532 | root | localhost:41759 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p22) where url like '%mysql%' | | 2533 | root | localhost:41760 | NULL | Query | 2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p23) where url like '%mysql%' | | 2534 | root | localhost | NULL | Query | 0 | starting | show full processlist | +------+------+-----------------+-----------+---------+-------+--------------+-------------------------------------------------------------------------------------------------------------------+

And CPU utilization:

Tasks: 41 total, 1 running, 33 sleeping, 7 stopped, 0 zombie %Cpu0 : 91.9 us, 1.7 sy, 0.0 ni, 6.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 97.3 us, 2.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 97.0 us, 3.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 97.7 us, 2.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 95.7 us, 2.7 sy, 0.0 ni, 1.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 98.3 us, 1.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 98.3 us, 1.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 97.7 us, 2.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 : 96.7 us, 3.0 sy, 0.0 ni, 0.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 98.3 us, 1.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 95.7 us, 4.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 97.7 us, 2.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 : 98.0 us, 2.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 : 98.0 us, 1.7 sy, 0.0 ni, 0.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 97.7 us, 2.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 97.3 us, 2.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu16 : 98.0 us, 2.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu17 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu18 : 97.3 us, 2.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu19 : 98.7 us, 1.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu20 : 99.3 us, 0.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu21 : 97.3 us, 2.3 sy, 0.0 ni, 0.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu22 : 97.0 us, 3.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu23 : 96.0 us, 4.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st ... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18901 mysql 20 0 25.843g 0.017t 7808 S 2386 37.0 295:34.05 mysqld

Now, here is our “salad”:

p1:2499 p23:2366 p2:2297 p0:4735 p12:12349 p14:1412 p3:2045 p16:4157 p20:3160 p18:8717 p17:2967 p13:4519 p15:5462 p10:1312 p5:2815 p7:4644 p9:766 p4:3218 p6:4175 p21:2958 p8:929 p19:4182 p22:3231 p11:4020

As we can see, all partitions are in random order. If needed, we can even sort the result array (which isn’t needed for this example as we only care about the total). Finally our result and timing:

All done! Total: 88935 real 0m30.668s user 0m0.256s sys 0m0.028s

Timing and Results

  • Original query, single thread: 5 minutes
  • Modified query, 24 threads in Node JS: 30 seconds
  • Performance increase: 10x

If you are interested in the original question (MySQL versus PostgreSQL, Jan 2008):

  • MySQL, total visits: 88935
  • PostgreSQL total visits: 17753

Further Reading:

PS: Original Asynchronous Salad Joke, by Vlad @Crazy_Owl (in Russian)

Categories: MySQL

Join me for the Community Open House for MongoDB

MySQL Performance Blog - Fri, 2016-05-27 18:36

If you can make it to Manhattan Thursday, June 30, 2016, please join me at the Community Open House for MongoDB. The Community Open House for MongoDB, held from 9 AM to 6 PM at the Park Central Hotel, is a free and open event that will feature technical presentations and sessions from key members of the MongoDB open source community. It’s the day after MongoDB World ends, so many of you will already be in town and ready to talk MongoDB – stay one more day to get an even bigger and broader perspective on your MongoDB needs!

The open source community is a diverse and powerful collection of companies, organizations and individuals that have helped to literally change the world. Percona is proud to call itself a member of the open source community, and we strongly feel that upholding the principles of the community is a key to our success. These principals include an open dialog, an open mind, and a zeal for cooperative interaction. Since Percona and ObjectRocket weren’t allowed to participate in MongoDB World, we felt this event would help foster creative and productive dialogue within the community. Together, we can build amazing things!

A reception will be held from 4:30 PM to 6:00 PM, featuring plenty of food, drink and fun.

Event Speakers:

The Community Open House for MongoDB features expert speakers from Percona, ObjectRocket, Facebook, Appboy, Severalnines and The Washington Post. Here’s a partial lineup – more will be announced soon!

  • Open Source Monitoring for MongoDB – Peter Zaitsev, Percona
  • Real-world Operational Concerns for Scaling MongoDB – Kim Wilkins, ObjectRocket
  • Doing More With Less With RocksDB and MongoRocks at Facebook – Islam AbdelRahman, Facebook
  • Tracking Billions of App Installs with MongoDB – Jon Hyman, Appboy
  • WiredTiger Scalability and Performance Optimization – Vadim Tkachenko, Percona
  • Using MongoDB and NodeJS to Build Better Forms at the Washington Post – Kat Styons, The Washington Post
  • Real-world Operational Concerns for Scaling MongoDB – Tim Banks, ObjectRocket
  • How to Automate, Monitor and Manage Existing MongoDB Servers – Art van Scheppingen, Severalnines
  • Open Source Encryption for MongoDB – Tim Banks, ObjectRocket
  • My First Moments with MongoDB – Colin Charles
  • Open Source Backups for MongoDB – David Murphy – Percona
  • MongoDB Chunks – Distribution, Splitting & Merging – Jason Terpko, ObjectRocket
  • MongoDB for MySQL Developers and DBAs – Alexander Rubin, Percona

This event is free of charge and open to all, but we do ask you to register in advance so we can save you a seat.

I hope to see you Thursday in NYC!

Categories: MySQL

Monitoring made easy with Percona App for Grafana

MySQL Performance Blog - Thu, 2016-05-26 17:57

Percona has released a new Percona App for Grafana!

Are you using Grafana 3.x with Prometheus’ time-series database? Now there is a “Percona App” available on! The app provides a set of dashboards for MySQL performance and system monitoring with Prometheus’ datasource, and make it easy for users install them. The dashboards rely on the alias label in the Prometheus config and depend on the small patch applied on Grafana.

The dashboards in the app are:

  • Cross Server Graphs
  • Disk Performance
  • Disk Space
  • Galera Graphs
  • MySQL InnoDB Metrics
  • MySQL MyISAM Metrics
  • MySQL Overview
  • MySQL Performance Schema
  • MySQL Query Response Time
  • MySQL Replication
  • MySQL Table Statistics
  • MySQL User Statistics
  • Prometheus
  • Summary Dashboard
  • System Overview
  • TokuDB Graphs
  • Trends Dashboard

The Grafana and Prometheus teams are doing a fantastic job of bringing monitoring and time-series to the next level. They are making collecting and graphing metrics simple and more usable.

See my previous blog post for step-by-step instructions on how to install Grafana and Prometheus. Get the Percona App for Grafana today!

Categories: MySQL

AWS Aurora Benchmarking part 2

MySQL Performance Blog - Thu, 2016-05-26 12:45

Some time ago, I published the article on AWS Aurora Benchmarking (AWS Aurora Benchmarking – Blast or Splash?), in which I analyzed the behavior of different solutions using synchronous replication in AWS environment. This blog follows up with some of the comments and suggestions I received regarding that post from the community and Amazon engineers.

I decided to perform another round of tests, keeping in mind comments and suggestions received.

I presented some of the results during the Percona conference in Santa Clara last April 2016. The following is the transposition that presentation, with more details.

Not interested in the preliminary descriptions? Go to the results section

Why new tests?

A very good question, with an easy answer.

Aurora is a product that is still under development and refinement: six months of development could present major changes in performance. Not only that, but the initial tests focused on entry-level solutions, meaning I was analyzing the kind of users that are currently starting their business and looking for a flexible solution that allows them to save money and scale.

This time, I put the focus on enterprise solutions by analyzing what an already well-established company would get when looking for a decent scalable solution.

These are two different scenarios.

Why so many (different) tests?

I used many different benchmarking tools, and I am still planning to run others. Why so? Why not simply use one of them?

Again, a simple answer. I used different tools because in some cases, they provide me a different way of accessing and using data. I also do not trust benchmarking tools, not even the ones I developed. I wanted to test the same thing using different tools and compare the results. ONLY if I see a common pattern, then would I consider the test valid. Personally, I tend to discard any test that is not consistent, or if the analysis performed is using a single benchmarking tool. In my opinion, being lazy is not an option when doing these kind of exercises.

About the tests

It was difficult to compare apples to apples here. And I think that is the main point to keep in mind.

Aurora is not a standard RDS solution, like we are used to. Aurora looks like MySQL, smells like MySQL, but is not vanilla MySQL. To achieve what they have, the engineers had to change many parts. The more you dig in, the more you realize there are significant differences.

Because of that, I had to focus more on identifying what each solution can do and compare the solutions against expectations, rather than comparing the numbers.

I was more interested to see what happen if:

  • I have a burst of connections, and my application goes from 4K to 40K connections. Will it crash? Will it slow down?
  • How long should I wait if a node fails?
  • What should I not have in my schema design, to prevent bottlenecks?

Those are relevant questions in my opinion, more so than discovering that solution A has 3000 rows written/sec, and solution B has 3100. Or that I might (might) have some additional page rotation, file -> memory-> flushes because the amount of memory differs.

That is valuable information, for sure, but less valuable than having a decent understanding of which platform will help my business grow and remain stable.

What is the right tool for the job? This is the question I am addressing.

Tests run

I had run three main kinds of tests:

  • Performance and load stress
  • High availability failover
  • Response time (latency) from the application point of view
Performance and load stress

These tests were the most extensive and demanding.

I analyzed the capacity to serve the load under different conditions, from a light load up to full utilization, and some degree of resource saturation.

  • The first set of tests were to evaluate a simple load on a single table, causing the table to become a hotspot and showing how the platform would manage the increasing contention.
  • The second set of tests were to perform a similar load, but distributing it cross multiple tables and batching the operations. Parallelization, contention, scalability and distributed hotspots were in the picture.

The two above focused on write operations only, and were done using different tools (comparing the results as they were complementary).

  • Third set of tests, using my own stress tool, were focused on R/W oriented usage. The tests were executed against multiple tables, performing CRUD actions, using simple and batch insert, reads by PK, index, by range, IN and exact match conditions.
  • The fourth set of tests were performed using a TPC-C like load (OLTP).
  • The fifth set of tests were using sysbench in OLTP mode, with 250 tables.

The scope of the last three set of tests was to identify how the platforms would manage the load, considering the following:

  • Read and write contention on the same tables
  • High level of parallelism (from the application)
  • Possible hot-spots (TPCC district)
  • Increasing utilization (memory, threads, IO)
  • Saturation (connections)

Finally, all tests were run with fully utilized BufferPool.

The machines

Small boxes (first round of tests):

EIP = 1 VPC = 1 ELB=1 Subnets = 4 (1 public, 3 private) HAProxy = 6 MHA Monitor (micro ec2) = 1 NAT Instance (EC2) =1 (hosting EIP) DB Instances (EC2) = 3 (m4.xlarge) 16GB Application Instances (EC2) = 6 (4) EBS SSD 3000 PIOS Aurora RDS node = 3 (db.r3.xlarge) 30GB

Large boxes (latest tests):

EIP = 1 VPC = 1 ELB=1 Subnets = 4 (1 public, 3 private) HAProxy = 4 MHA Monitor (micro ec2) = 1 NAT Instance (EC2) =1 (hosting EIP) DB Instances (EC2) = 3 (c3.8xlarge) 60GB Application Instances (EC2) = 4 EBS SSD 5000 PIOS Aurora RDS node = 3 (db.r3.8xlarge) 244GB

A note

It was pointed out to me that I deliberately chose to use an Ec2 solution for Percona XtraDB Cluster with less memory than the one available in Aurora. This is true, and we must take that into consideration. The reason for this is that the only Ec2 solution matching the memory of a db_r3.8xlarge is the d2.8xlarge.

I did try it, but the level of scalability I got (from the CPU point of view) was less efficient than the one available with c3.8xlarge. I decided to prefer CPU resources to memory, especially because I was going to test concurrency and parallelism in conjunction with the load increase.

From the result, I feel confident that I chose correctly – but I am open to comment.

The layout

This is what the setup looks like:

Where you read Java, those are the application nodes running the different test applications.

Two words about Aurora first

Aurora has a few key concepts that we must have clearly in mind, especially how it manages the writes across replica, and how connections are implemented.

The IO activity

To replicate the information across the different storage, Aurora only replicates FRM files and data coming from IB_LOGS. This is a quite significant advantage to other forms of replication, given the limited number of bytes that are replicated over the network (and also if they are replicated six times).

Another significant advantage is that Aurora does not use a double write buffer, which is obviously another blast (see the recent optimization in Percona Server ).

In other words, writes in Aurora are organized by filling its commit queue and pushing the changes as group commit to the storage.

In some presentations, you might have seen that all steps are asynchronous. But is important to underline that a commit is acknowledged by Aurora when at least two availability zones (AZ) have received and written the incoming data related to that commit. Writes here mean received in the storage node incoming queue and with a quorum of four over six nodes.

This means that no matter what, data has to travel on the network to reach the final destination, and ACK signals come back before Aurora returns the ACK to the commit operation. The network is in the same region, but still it could represent an incognita about performance. No wonder we could have some latency at this stage!

As you can see, what I am reporting is also confirmed in the image below (and in the observations). The point is that the impact of steps 1 – 2 is not obviously clear.

Thread pooling

Aurora also use thread pooling – a lot! That will become very clear later, and as more of the work is based on parallelism, the more efficient thread pooling seems to be.

In most cases we are used to seeing CPUs on database servers not fully utilized, unless there is some heavy ordering operation or a bad query. That behavior is also (not only) a direct consequence of the connection-to-thread model, which implies a period of latency and stand by. In Aurora, the incoming connections are not following the same model. Instead, the pool redistributes the load of the incoming connection to a pool of threads, optimizing the latency period, resulting in a higher CPU utilization. Which is what you want from your resource: to be utilized and not waiting for something else to do its job.


The results

Without wasting more electronic ink, let see what comes out of this round of tests (not the final one by the way). To simplify the results, I will also report the graphs from the first set of tests, but will focus on the latest.Small Boxes = SB, Large Boxes LB.

Small Boxes = SB, Large Boxes = LB.

First Test: IIBench

As declared previously, my scope was to verify how the two platforms would have reacted to a simple load focus on inserts with a basic single table. The bufferpool was saturated before running the test.



As we can see, in the presence of a hot spot the solution using Percona XtraDB Cluster outperformed Aurora, in both cases. What is notable, though, is that while XtraDB Cluster remained approximately around the same time/performance, Aurora is significantly reduced the time taken. This shows that Aurora was taking advantage of the more powerful platform, while XtraDB Cluster was not able to.

With further analyzation of the details, we notice that Aurora performs better atomically. It was able to manage more writes/second as well as rows and pages managed. But it was inconsistent: Aurora had performance hiccups at regular intervals. As such the final result was that it took more time to process the whole workload.

I was not able to dig to deeply, given some metrics are not fully available in Aurora. As such I had to rely fully on Aurora engineers, who mentioned to me that hot-spot contention was a possible issue.

Aurora Handler calls:

XtraDB Cluster Handlers:

The execution in XtraDB Cluster showed fewer calls but constant performance, while Aurora has hiccups.

Aurora page activity write:

XtraDB Cluster page activity write:

The trend shown by the handlers stayed consistent in the page management and rows insert, as expected.

Second Test: Application Ingest

As mentioned, this test showed many threads from different application servers, inserted by a batch of 50 statements against multiple tables.

The results coming from this test are quite favorable to Aurora, as we can see starting from the time taken to complete the same workload:



With small ones, the situation was inverted.

But here is where the interesting part starts.

Aurora can manage significantly higher numbers of rows, as the picture below shows:

The results are also constant, and don’t decrease significantly like the inserts with XtraDB Cluster.

The number of handler commits, however, are significantly less.

Once more they stay the same with the load increase, without impacting performance.

Reviewing all handler calls, we get our first surprise.

XtraDB Cluster handler calls:

Aurora handler calls:

The gap/drop existing in the two graphs are the different tests (with an increasing number of threads).

Two things to notice here: the first one is that XtraDB Cluster decreases in performance while processing the load, while Aurora does not. The second (you need to zoom the image) is the number of commits is floating in XtraDB Cluster, while it stays fixed in Aurora.

An even bigger surprise comes up when reviewing the connections graphs.

As expected, XtraDB Cluster has all my connections open, and the number of threads running is quite close to the number of connected threads.

Both of them follow the increasing number of connected threads.

But this is not the case in Aurora.

Also, if my applications are trying to open ~800 threads, the Aurora node see only a part of them, and the number of running is fixed to 32 threads.

The important things to consider here are that a) my applications don’t connect directly to the Aurora instance, but to a connector (MariaDB), and b) that Aurora, in this case, caps the number of running threads to the number of CPU available on the instance (here 32).

Given that, I expected to have worse performance (but I don’t). The fact that Aurora uses one thread for multiple connections seems to be working quite efficiently.

The number of rows inserted is also consistent with the handler calls, and has better performance than XtraDB Cluster.

Aurora rows inserted:

XtraDB Cluster rows inserted

Again we have the same trend, only, this time, Aurora performs better than XtraDB Cluster.

Third Test: OLTP Application

When run on the small boxes, this test saw XtraDB Cluster performing much better than Aurora. The time taken by Aurora was ~3 times the time taken by XtraDB Cluster.

With a large box, I had the inverse result: Aurora is outperforming XtraDB Cluster from 2 to 7 times the speed.

Analyzing the number of commands executed with the increasing workload, we can see how XtraDB Cluster can perform better than Aurora with a workload of 128 threads, but starts to have worse performance as the load increases.

On the other hand, Aurora manages the read/write load without significant performance loss, which includes being able to increase the number of commits/sec.

Reviewing the handler calls, we see that the handler commit calls are significantly less in Aurora (as already noticed in the ingest tests).

Another thing to note is that the number of calls for XtraDB Cluster is significantly higher and not scaling, while Aurora has a nice scaling trend.

Fourth Test: TPCC-mysql

The TPCC test is mainly to test OLTP traffic, with the note that some tables (like district) might become a hotspot. The tests I ran were executed against 400 warehouses, and used 128 threads maximum for the small box and 2048 threads for the large box.

During this test, I hit one of the Aurora limitations and I escalated it to the Aurora engineers (who are aware of the problem).

Small boxes:

In the case of small boxes, there is nothing to say: XtraDB Cluster manages the load more efficiently. This trend is not optimal, having significant fluctuation. Aurora is just not able to keep it up.

Large boxes:


It is a different and a more complex scenario in the case of the use of large boxes. I would like to say that Aurora performs better.

This is true for two of the three tests, and up to when it got stuck by internal limitation Aurora was also performing better on the third. But then its performance just collapsed.

With a more in-depth investigation, I noticed that under the hood Aurora was not performing as well as it appeared. This comes out quite clearly by looking at a comparison between the graphs covering Comm_ execution, open files, handlers and InnoDBrow lock time.

In all of them it is evident how XtraDB Cluster keeps serving the workload with consistent behavior, while Aurora fails the second test on (512 threads) — not just on the third with 2048 threads.


XtraDB Cluster:

It is clear that Aurora was better served during the test with 256 threads going over the 450K com select serve (in 10 sec interval), compared with XtraDB Cluster that was not able to go over 350K.

But in the following tests, while XtraDB Cluster was able to keep going (with decreasing performance), Aurora started to struggle with very inconsistent behavior.

This was also confirmed by the open files graph.


XtraDB Cluster:

The graphs show the instances of files open during the test, not the ones already open. It reflects the Open_file metric “The number of files that are open. This count includes regular files opened by the server. It does not include other types of files such as sockets or pipes. Also, the count does not include files that storage engines open using their own internal functions rather than asking the server level to do so.”

I was quite surprised by the number of files open by Aurora.

Handlers reflected the same behavior, as well.


XtraDB Cluster:

Perfectly in line with the com trend.

So what was increasing in reverse?


XtraDB Cluster:

As you can see from the above, the exactly same workload generated an increasing lock row time, from quite low in the test with 256 threads, up to a crazy high with 2048 threads.

As mentioned, we know that TPCC has a couple of tables that act as hotspots, and we already saw with IIbench how Aurora is not working efficiently in that case.

I also was getting a lot of 188 errors during the test. This is an Aurora internal error. When I reported it, I was told they know about it, and they are planning to work on it.

I hope they do soon, because if this issue is solved it is very likely that Aurora will not only be able to manage the tested workload, but exceed it by far.

I am saying this because also with the identified issues Aurora was able to keep going and manage a more than decent response time during the second test (with 512 threads).

Fifth Test: Sysbench

I added the sysbench tests to test scalability, and to see the what happens when the system reaches a saturation point. This test brought up some limitations existing in the Aurora solution, related more to the connector than the Aurora engine itself.

Aurora has a limit of 16k connections. I wanted to see what happens if I got to saturation point or close to it. It doesn’t matter if this is a ridiculously high number or not.

What happened was that Aurora managed traffic up to 4K. The closer I got to the limit, however, the more I had a connectivity issue. At the end I had to run the test with 8K, 12K and 20K threads pointing directly to the Aurora instance, bypassing the connector that was not able to serve the traffic. After that, I was able to hit up to ~15500 threads (but with a lot of inconsistent performance). I am defining the limit of a meaningful test from the previous level of 12K threads.

XtraDB Cluster was able to scale up to 16K no problem.

What also is notable here is that Aurora was able to manage the workload more efficiently regarding transaction handling (i.e., as transactions executed and latency).

The number of transactions executed by Aurora was ~three times the one executed by XtraDB Cluster.

Regarding latency, Aurora showed less latency then XtraDB Cluster.

Internally, Aurora and XtraDB Cluster operations were once again different regarding how the workload was handled. The most divergent result was the handler calls:

Commit calls in Aurora were a fraction of the calls in XtraDB Cluster, while the number of rollbacks was higher.

The read calls had an even more divergent behavior, with XtraDB Cluster performing a higher number of read_keys, while Aurora was having a very limited number of them. Read_rnd are very high in XtraDB Cluster, but totally absent in Aurora (note that in Aurora, read_rnds are reported but seem not to increase). On the other hand, Aurora reported a high number of read_rnd_next, while XtraDB Cluster has none.

HA availability Fail-over time

Both solutions:

In this test, the fail-over time for the solution using Galera and HAProxy was more efficient. For both a limited or mid-level load. One assumption is that given Aurora has to verify both the status of the data transmitted and its consistency across the six data store nodes in every case; the process is not as fast as it could be.

It could also be that the cluster connector is not as efficient as it should in redirecting the traffic from one node to another. It would be a very interesting exercise to replace it with some other custom solution.

Note that I was performing the tests following the Amazon recommendation to use the following to simulate a real crash:


As such, I was not doing anything strange or out of the ordinary.

It is worth mentioning that of the eight seconds taken by MySQL/Galera to perform the failover, six were due to the HAProxy settings (which had a 3000 ms interval and two loops in the settings before executing failover).

Execution latency

The purpose of these tests was to identify the latency existing between the moment that application sends the request and the moment MySQL/Aurora took the request in “charge”. The expectation is that the busier the database, the higher the latency.

For this test, I reported both results: the one coming from the old tests with the small box, and the new one with the large box.

Small boxes:

Large boxes:

It is clear from the graphs that the two tests report different scenarios. In the first, Galera was able to manage the load more efficiently and serve requests with lower latency. For the new tests, I had used a higher number of threads than the ones for the small box. Nevertheless, in the second test the CPU utilization and the number of running threads lead me to think that Aurora was finally able to utilize resources more efficiently and the lower latency.

The latency jumped up again when the number of connections rose above 12K, but that was expected given previous tests results.

Conclusions High Availability

The two platforms were able to manage the failover operation in a limited time frame (below 1 minute). Nevertheless, MySQL/Galera was shown to be more efficient and consistent. This result is a direct consequence of synchronous replication, which by design prevents MySQL/Galera from allowing an active node to fall behind.

In my opinion, the replication method used in Aurora is efficient, and given that data is shared across the read replicas, fail-over should happen faster.

The tests suffered because of the connector, and I have the feeling that having another solution in place may bring some surprises (actually, I would like to test that as well).


In this run of tests, Aurora was able to invert the results I had in the first test with the small boxes. In almost all cases, Aurora performed as well or better then XtraDB Cluster. There are still cases where Aurora is penalized, and those are the ones where hotspots are present. The contention in Aurora is killing performance, and raise errors (188). But I hope we will see a significant evolution soon.

General Comments on Aurora

The product is evolving quickly, and benchmark results may become obsolete in very short time (this is why it is important to have repeatable and comparable tests). From my point of view, in this set of tests Aurora clearly shows where it’s a better fit: higher-end levels, where high availability and CPU power is the focus (not concerns about the cost).

There is no reason to use Aurora in small-mid boxes: the platform is not going to be as efficient as a standard solution like XtraDB Cluster. But if cost is not an issue, and the applications require a lot of parallelism, Aurora on db.r3.8xlarge is a good solution.

I still see space for improvements (like for cluster connectors, or the time taken to restart a cluster after a full stop, or contention reduction). But I am also confident that the work led by the development team will fix most of my concerns (and more) soon.

Final note: it would be nice to have the code open source, so that the community could contribute (but I understand the business reasons not to).

About Cost

I don’t think it is this the right place to mention the cost of each solution (especially because each need is different).

As such, I am not reporting any specific numbers. You can, however, follow the links below and do the necessary math:

Aurora cost calculator

AWS cost calculator


Categories: MySQL

MongoDB 3.2: elections just got better!

MySQL Performance Blog - Wed, 2016-05-25 19:26

In this blog, we’ll review MongoDB 3.2 elections and how they work, as well as what is really new and different in the election protocol.

MongoDB 3.2 revamped its election protocol for increased stability! Exciting times, with smarter and faster elections are here! With this latest release, you will find that replication (and the election protocol) have been improved. Some of the changes include:

  • The addition of electionTimeoutMS
  • WriteConcern  now implies “j:true”
    • Old j:true meant just the primary node
    • New j:true means all involved nodes must ACK the journal
    • j:true means your journal MS will be thirded, and synchronization occurs every 10ms (MMAP) or 33ms (WiredTiger) by default
  • Optime in rs.status now an Object, not a Timestamp

You’ll need to enable the Election Protocol when upgrading MongoDB from an earlier version, while new replSets get it enabled by default.

Election Protocol: what is an election?

Mongo uses a consensus protocol. This means that all nodes must agree who is the most current when handing:

  • Hardware failure
  • Network split
  • Time shifts

New updates allow for faster elections using an (term) electionId to prevent timeout between separate voting rounds. This guarantees there aren’t double (and conflicting) votes while also reducing the time to wait to know a vote completed.

How does it do it?

Elections now have “term” or “vote” identifiers (ID). Terms are used to separate voting rounds. Every vote attempt increments the ID. The ID incrementation prevents a node from double voting in the same term, and makes it easier for nodes to know if a re-vote is needed where before it could be up to 5 minutes!

The protocol timeouts have some new features and behaviors:

  • Now configurable
  • Randomness added to each node
  • Less chance all node timeout at the same time
Normal election process

Below I’m going to walk you through a typical replica set operation. The configuration looks like the following:

In this topology:

  • There are three members
  • All of them are heartbeating to each other
  • There is no arbiter, so you get full high availability (HA)

The following diagram provides a more detailed picture of the interactions:

Notice how replication pulls from the primary to each secondary from the primary – the secondary does all the work. A heartbeat is still shared by all the nodes.

Now let’s see what happens when our primary crashes. It just did!

Nodes will still try to heartbeat to it until two have failed in a short period.

After the failure, things happen quickly.

  1. Secondaries give up on heartbeats
  2. They then vote with each other on who is newest in oplog
  3. If they have > 50% of total voting population they select a new winner

A new Primary is selected, and the heartbeat system is cleaned up.

Replication now gets restarted. If the fatal node comes back online, it’s treated as a secondary once it “catches up” via the oplog.

Stepdown Election Process

The stepdown election process is the same as above, with the following caveats:

  • It’s MUCH faster, as the existing primary “starts” an election
  • There is less chance of the old primary not having data replicated
  • It kills writes while doing election
  • The election process doesn’t wait for heartbeat timeouts

Generally speaking, you should always try to use the stepdown election process. Timeouts are for crashes and failures, not general use.


Categories: MySQL

Percona Server 5.6.30-76.3 is now available

MySQL Performance Blog - Wed, 2016-05-25 16:54

is glad to announce the release of Percona Server 5.6.30-76.3 on May 25, 2016. Download the latest version from the Percona web site or the Percona Software Repositories.

Based on MySQL 5.6.30, including all the bug fixes in it, Percona Server 5.6.30-76.3 is the current GA release in the Percona Server 5.6 series. Percona Server is open-source and free – this is the latest release of our enhanced, drop-in replacement for MySQL. Complete details of this release can be found in the 5.6.30-76.3 milestone on Launchpad.

Bugs Fixed:

  • When Read Free Replication was enabled for TokuDB, and there was no explicit primary key for the replicated TokuDB table, there could be duplicated records in the table on update operation. The fix disables Read Free Replication for tables without an explicit primary key and does rows lookup for UPDATE and DELETE binary log events and issues warning. Bug fixed #1536663 (#950).
  • Attempting to execute a non-existing prepared statement with Response Time Distribution plugin enabled could lead to a server crash. Bug fixed #1538019.
  • TokuDB was using using different memory allocators; this was causing safemalloc warnings in debug builds and crashes because memory accounting didn’t add up. Bug fixed #1546538 (#962).
  • Fixed heap allocator/deallocator mismatch in Metrics for scalability measurement. Bug fixed #1581051.
  • Percona Server is now built with system zlib library instead of the older bundled one. Bug fixed #1108016.
  • Reduced the memory overhead per page in the InnoDB buffer pool. The fix was based on Facebook patch #91e979e. Bug fixed #1536693 (upstream #72466).
  • CREATE TABLE ... LIKE ... could create a system table with an unsupported enforced engine. Bug fixed #1540338.
  • Change buffer merge could throttle to 5% of I/O capacity on an idle server. Bug fixed #1547525.
  • Slave_open_temp_tables would fail to decrement on the slave with a disabled binary log if the master was killed. Bug fixed #1567361.
  • The server will now show a more descriptive error message when Percona Server fails with errno == 22 "Invalid argument", if innodb_flush_method was set to ALL_O_DIRECT. Bug fixed #1578604.
  • Killed connection threads could get their sockets closed twice on shutdown. Bug fixed #1580227.
  • AddressSanitizer build with LeakSanitizer enabled was failing at gen_lex_hash invocation. Bug fixed #1580993 (upstream #80014).
  • apt-cache show command for percona-server-client was showing innotop included as part of the package. Bug fixed #1201074.
  • mysql-systemd would fail with PAM authentication and proxies due to a regression introduced when fixing #1534825 in Percona Server 5.6.29-76.2. Bug fixed #1558312.
  • Upgrade logic for figuring if TokuDB upgrade can be performed from the version on disk to the current version was broken due to a regression introduced when fixing bug #684 in Percona Server 5.6.27-75.0. Bug fixed #717.
  • If ALTER TABLE was run while tokudb_auto_analyze variable was enabled it would trigger auto-analysis, which could lead to a server crash if ALTER TABLE DROP KEY was used because it would be operating on the old table/key meta-data. Bug fixed #945.
  • The TokuDB storage engine with tokudb_pk_insert_mode set to 1 is safe to use in all conditions. On INSERT IGNORE or REPLACE INTO, it tests to see if triggers exist on the table, or replication is active with !BINLOG_FORMAT_STMT before it allows the optimization. If either of these conditions is met, then it falls back to the “safe” operation of looking up the target row first. Bug fixed #952.
  • Bug in TokuDB Index Condition Pushdown was causing ORDER BY DESC to reverse the scan outside of the WHERE bounds. This would cause a query to hang in a sending data state for several minutes in some environments with large amounts of data (3 billion records) if the ORDER BY DESC statement was used. Bugs fixed #988, #233, and #534.

Other bugs fixed: #1399562 (upstream #75112), #1510564 (upstream #78981), #1496282 (#964), #1496786 (#956), #1566790, #1552673, #1567247, #1567869, #718, #914, #970, #971, #972, #976, #977, #981, #637, and #982.

Release notes for Percona Server 5.6.30-76.3 are available in the online documentation. Please report any bugs on the launchpad bug tracker.

Categories: MySQL

Looking inside the MySQL 5.7 document store

MySQL Performance Blog - Tue, 2016-05-24 22:36

In this blog, we’ll look at the MySQL 5.7 document store feature, and how it is implemented.

Document Store

MySQL 5.7.12 is a major new release, as it contains quite a number of new features:

  1. Document store and “MongoDB” like NoSQL interface to JSON storage
  2. Protocol X / X Plugin, which can be used for asynchronous queries (I will write about it as well)
  3. New MySQL shell

Peter already wrote the document store overview; in this post, I will look deeper into the document store implementation. In my next post, I will demonstrate how to use document store for Internet of Things (IoT) and event logging.

Older MySQL 5.7 versions already have a JSON data type, and an ability to create virtual columns that can be indexed. The new document store feature is based on the JSON datatype.

So what is the document store anyway? It is an add-on to a normal MySQL table with a JSON field. Let’s take a deep dive into it and see how it works.

First of all: one can interface with the document store’s collections using the X Plugin (default port: 33060). To do that:

  1. Enable X Plugin and install MySQL shell.
  2. Login to a shell:
    mysqlsh --uri root@localhost
  3. Run commands (JavaScript mode, can be switched to SQL or Python):
    mysqlsh --uri root@localhost Creating an X Session to root@localhost:33060 Enter password: No default schema selected. Welcome to MySQL Shell 1.0.3 Development Preview Copyright (c) 2016, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help', 'h' or '?' for help. Currently in JavaScript mode. Use sql to switch to SQL mode and execute queries. mysql-js> db = session.getSchema('world_x') <Schema:world_x> mysql-js> db.getCollections() { "CountryInfo": <Collection:CountryInfo> }

Now, how is the document store’s collection different from a normal table? To find out, I’ve connected to a normal MySQL shell:

mysql world_x Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or g. Your MySQL connection id is 2396 Server version: 5.7.12 MySQL Community Server (GPL) Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or 'h' for help. Type 'c' to clear the current input statement. mysql> show create table CountryInfo *************************** 1. row *************************** Table: CountryInfo Create Table: CREATE TABLE `CountryInfo` ( `doc` json DEFAULT NULL, `_id` varchar(32) GENERATED ALWAYS AS (json_unquote(json_extract(`doc`,'$._id'))) STORED NOT NULL, PRIMARY KEY (`_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec) mysql> show tables; +-------------------+ | Tables_in_world_x | +-------------------+ | City | | Country | | CountryInfo | | CountryLanguage | +-------------------+ 4 rows in set (0.00 sec)

So the document store is actually an InnoDB table with one field: doc json + Primary key, which is a generated column.

As we can also see, there are four tables in the world_x database, but db.getCollections() only shows one. So how does MySQL distinguish between a “normal” table and a “document store” table? To find out, we can enable the general query log and see which query is being executed:

$ mysql -e 'set global general_log=1' $ tail /var/log/general.log 2016-05-17T20:53:12.772114Z 186 Query SELECT table_name, COUNT(table_name) c FROM information_schema.columns WHERE ((column_name = 'doc' and data_type = 'json') OR (column_name = '_id' and generation_expression = 'json_unquote(json_extract(`doc`,''$._id''))')) AND table_schema = 'world_x' GROUP BY table_name HAVING c = 2 2016-05-17T20:53:12.773834Z 186 Query SHOW FULL TABLES FROM `world_x`

As you can see, every table that has a specific structure (doc JSON or specific generation_expression) is considered to be a JSON store. Now, how does MySQL translate the .find or .add constructs to actual MySQL queries? Let’s run a sample query:

mysql-js> db.getCollection("CountryInfo").find('Name= "United States"').limit(1) [ { "GNP": 8510700, "IndepYear": 1776, "Name": "United States", "_id": "USA", "demographics": { "LifeExpectancy": 77.0999984741211, "Population": 278357000 }, "geography": { "Continent": "North America", "Region": "North America", "SurfaceArea": 9363520 }, "government": { "GovernmentForm": "Federal Republic", "HeadOfState": "George W. Bush", "HeadOfState_title": "President" } } ] 1 document in set (0.02 sec)

and now look at the slow query log again:

2016-05-17T21:02:21.213899Z 186 Query SELECT doc FROM `world_x`.`CountryInfo` WHERE (JSON_EXTRACT(doc,'$.Name') = 'United States') LIMIT 1

We can verify that MySQL translates all document store commands to SQL. That also means that it is 100% transparent to the existing MySQL storage level and will work with other storage engines. Let’s verify that, just for fun:

mysql> alter table CountryInfo engine=MyISAM; Query OK, 239 rows affected (0.06 sec) Records: 239 Duplicates: 0 Warnings: 0 mysql-js> db.getCollection("CountryInfo").find('Name= "United States"').limit(1) [ { "GNP": 8510700, "IndepYear": 1776, "Name": "United States", "_id": "USA", "demographics": { "LifeExpectancy": 77.0999984741211, "Population": 278357000 }, "geography": { "Continent": "North America", "Region": "North America", "SurfaceArea": 9363520 }, "government": { "GovernmentForm": "Federal Republic", "HeadOfState": "George W. Bush", "HeadOfState_title": "President" } } ] 1 document in set (0.00 sec) 2016-05-17T21:09:21.074726Z 2399 Query alter table CountryInfo engine=MyISAM 2016-05-17T21:09:41.037575Z 2399 Quit 2016-05-17T21:09:43.014209Z 186 Query SELECT doc FROM `world_x`.`CountryInfo` WHERE (JSON_EXTRACT(doc,'$.Name') = 'United States') LIMIT 1

Worked fine!

Now, how about the performance? We can simply take the SQL query and run explain:

mysql> explain SELECT doc FROM `world_x`.`CountryInfo` WHERE (JSON_EXTRACT(doc,'$.Name') = 'United States') LIMIT 1 *************************** 1. row *************************** id: 1 select_type: SIMPLE table: CountryInfo partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 239 filtered: 100.00 Extra: Using where 1 row in set, 1 warning (0.00 sec)

Hmm, it looks like it is not using an index. That’s because there is no index on Name. Can we add one? Sure, we can add a virtual column and then index it:

mysql> alter table CountryInfo add column Name varchar(255) -> GENERATED ALWAYS AS (json_unquote(json_extract(`doc`,'$.Name'))) VIRTUAL; Query OK, 0 rows affected (0.12 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> alter table CountryInfo add key (Name); Query OK, 0 rows affected (0.02 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> explain SELECT doc FROM `world_x`.`CountryInfo` WHERE (JSON_EXTRACT(doc,'$.Name') = 'United States') LIMIT 1 *************************** 1. row *************************** id: 1 select_type: SIMPLE table: CountryInfo partitions: NULL type: ref possible_keys: name key: name key_len: 768 ref: const rows: 1 filtered: 100.00 Extra: NULL 1 row in set, 1 warning (0.00 sec)

That is really cool! We have added an index, and now the original query starts using it. Note that we do not have to reference the new field, the MySQL optimizer is smart enough to translate the (JSON_EXTRACT(doc,'$.Name') = 'United States' to an index scan on the virtual column.

But please note: JSON attributes are case-sensitive. If you will use (doc,'$.name') instead of (doc,'$.Name') it will not generate an error, but will simply break the search and all queries looking for “Name” will return 0 rows.

Finally, if you looked closely at the output of db.getCollection("CountryInfo").find('Name= "United States"').limit(1) , you noticed that the database has outdated info:

"government": { "GovernmentForm": "Federal Republic", "HeadOfState": "George W. Bush", "HeadOfState_title": "President" }

Let’s change “George W. Bush” to “Barack Obama” using the .modify clause:

mysql-js> db.CountryInfo.modify("Name = 'United States'").set("government.HeadOfState", "Barack Obama" ); Query OK, 1 item affected (0.02 sec) mysql-js> db.CountryInfo.find('Name= "United States"') [ { "GNP": 8510700, "IndepYear": 1776, "Name": "United States", "_id": "USA", "demographics": { "LifeExpectancy": 77.0999984741211, "Population": 278357000 }, "geography": { "Continent": "North America", "Region": "North America", "SurfaceArea": 9363520 }, "government": { "GovernmentForm": "Federal Republic", "HeadOfState": "Barack Obama", "HeadOfState_title": "President" } } ] 1 document in set (0.00 sec)


Document store is an interesting concept and a good add-on on top of the existing MySQL JSON feature. Using the new .find/.add/.modify methods instead of the original SQL statements can be convenient in some cases.

Some might ask, “why do you want to use document store and store information in JSON inside the database if it is relational anyway?” Storing data in JSON can be quite useful in some cases, for example:

  • You already have a JSON (i.e., from external feeds) and need to store it anyway. Using the JSON datatype will be more convenient and more efficient.
  • You have a flexible schema, typical for the Internet of Things for example, where some sensors might only send temperature data, some might send temperature/humidity/light (but light information is only recorded during the day), etc. Storing it in the JSON format can be more convenient so that you do not have to declare all possible fields in advance, and do not have to run “alter table” if a new sensor starts sending new types of data.

In the next two blog posts, I will show how to use document store for Internet of Things / event streaming, and how to use X Protocol for asynchronous queries in MySQL.

Categories: MySQL

pt-online-schema-change (if misused) can’t save the day

MySQL Performance Blog - Tue, 2016-05-24 18:27

In this blog post we’ll discuss pt-online-schema-change, and how to correctly use it.

Always use pt-osc?

Altering large tables can be still a problematic DBA task, even now after we’ve improved Online DDL features in MySQL 5.6 and 5.7. Some ALTER types are still not online, or sometimes just too expensive to execute on busy production master.

So in some cases, we may want to apply an ALTER first on slaves, taking them out of traffic pool one by one and bringing them back after the ALTER is done. In the end, we can promote one of the already altered slaves to be new master, so that the downtime/maintenance time is greatly minimized. The ex-master can be altered later, without affecting production. Of course, this method works best when the schema change is backwards-compatible.

So far so good, but there is another problem. Let’s say the table is huge, and ALTER takes a lot of time on the slave. When it is a DML-blocking type ALTER (perhaps when using MySQL 5.5.x or older, etc.), there will be a long slave lag (if the table is being written by replication SQL thread at the same time, for example). So what do we to speed up the process and avoid the altered slave lag? One temptation that could tempt you is why not use pt-online-schema-change on the slave, which can do the ALTER in a non-blocking fashion?

Let’s see how it that would work. I need to rebuild big table on slave using MySQL version 5.6.16 (“null alter” was made online since 5.6.17) to reclaim disk space after some rows are deleted.

This example demonstrates the process (db1 is the master, db2 is the slave):

[root@db2 ~]# pt-online-schema-change --execute --alter "engine=innodb" D=db1,t=sbtest1 No slaves found.  See --recursion-method if host db2 has slaves. Not checking slave lag because no slaves were found and --check-slave-lag was not specified. Operation, tries, wait: analyze_table, 10, 1 copy_rows, 10, 0.25 create_triggers, 10, 1 drop_triggers, 10, 1 swap_tables, 10, 1 update_foreign_keys, 10, 1 Altering `db1`.`sbtest1`... Creating new table... Created new table db1._sbtest1_new OK. Altering new table... Altered `db1`.`_sbtest1_new` OK. 2016-05-16T10:50:50 Creating triggers... 2016-05-16T10:50:50 Created triggers OK. 2016-05-16T10:50:50 Copying approximately 591840 rows... Copying `db1`.`sbtest1`:  51% 00:28 remain (...)

The tool is still working during the operation, and the table receives some writes on master:

db1 {root} (db1) > update db1.sbtest1 set k=k+2 where id<100; Query OK, 99 rows affected (0.06 sec) Rows matched: 99  Changed: 99  Warnings: 0 db1 {root} (db1) > update db1.sbtest1 set k=k+2 where id<100; Query OK, 99 rows affected (0.05 sec) Rows matched: 99  Changed: 99  Warnings: 0

which are applied on slave right away, as the table allows writes all the time.

(...) Copying `db1`.`sbtest1`:  97% 00:01 remain 2016-05-16T10:51:53 Copied rows OK. 2016-05-16T10:51:53 Analyzing new table... 2016-05-16T10:51:53 Swapping tables... 2016-05-16T10:51:53 Swapped original and new tables OK. 2016-05-16T10:51:53 Dropping old table... 2016-05-16T10:51:53 Dropped old table `db1`.`_sbtest1_old` OK. 2016-05-16T10:51:53 Dropping triggers... 2016-05-16T10:51:53 Dropped triggers OK. Successfully altered `db1`.`sbtest1`.

Done! No slave lag, and the table is rebuilt. But . . . let’s just make sure data is consistent between the master and slave (you can use pt-table-checksum):

db1 {root} (db1) > select max(k) from db1.sbtest1 where id<100; +--------+ | max(k) | +--------+ | 392590 | +--------+ 1 row in set (0.00 sec) db2 {root} (test) > select max(k) from db1.sbtest1 where id<100; +--------+ | max(k) | +--------+ | 392586 | +--------+ 1 row in set (0.00 sec)

No, it is not! The slave is clearly missing the updates that happened during a pt-osc run. Why?

The explanation is simple. The pt-online-schema-change relies on triggers. The triggers are used to make the writes happening to the original table also populate to the temporary table copy, so that both tables are consistent when the final table switch happens at the end of the process. So what is the problem here? It’s the binary log format: in ROW based replication, the triggers are not fired on the slave! And my master is running in ROW mode:

db1 {root} (db1) > show variables like 'binlog_format'; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | binlog_format | ROW | +---------------+-------+ 1 row in set (0.01 sec)

So, if I used pt-online-schema-change on the master, the data inconsistency problem doesn’t happen. But using it on the slave is just dangerous!


Whenever you use pt-online-schema-change, make sure you are not executing it on a slave instance. For that reason, I escalated this bug report: Also in many cases, using a normal ALTER will work well enough. As in my example, to rebuild the table separately on each slave in lockless mode, I would just need to upgrade to the more recent 5.6 version.

BTW, if you’re wondering about Galera replication (used in Percona XtraDB Cluster, etc.) since it also uses a ROW-based format, it’s not a problem. The pt-osc triggers are created in all nodes thanks to synchronous write-anywhere replication nature. It does not matter which node you start pt-online-schema-change on, and which other nodes your applications writes on at the same time. No slaves, no problem!

Categories: MySQL

Webinar Thursday May 26: Troubleshooting MySQL hardware resource usage

MySQL Performance Blog - Tue, 2016-05-24 15:46

Join Sveta on Thursday, May 26, 2016, at 10 am PDT (UTC-7) for her webinar Troubleshooting MySQL hardware resource usage.

MySQL does not just run on its own. It stores data on disk, and stores data and temporarily results in memory. It uses CPU resources to perform operations, and a network to communicate with its clients.

In this webinar, we’ll discuss common resource usage issues, how they affect MySQL Server performance, and methods to find out how resources are being used. We will employ both OS-level tools, and new features in Performance Schema that provide detailed information on what exactly is happening inside MySQL Server.

Register for the webinar here.

Sveta Smirnova, Principal Technical Services Engineer

Sveta joined Percona in 2015.

Her main professional interests are problem-solving, working with tricky issues, bugs, finding patterns that can solve typical issues quicker, teaching others how to deal with MySQL issues, bugs and gotchas effectively. Before joining Percona Sveta worked as Support Engineer in MySQL Bugs Analysis Support Group in MySQL AB-Sun-Oracle.

She is the author of the book MySQL Troubleshooting and JSON UDF Functions for MySQL.

Categories: MySQL

Take Percona’s one-click high availability poll

MySQL Performance Blog - Mon, 2016-05-23 20:47

Wondering what high availability (HA) solutions are most popular? Take our high availability poll below!

HA is always a hot topic. The reality is that if your data is not available, your customers cannot do business with you. In fact, estimates show the average cost of downtime is about $5K per minute. With an average outage taking 40 minutes to correct, you could be looking at a potential cost of $200K if your MySQL instance goes down. Whether your database is on premise, or in public or private clouds, it is critical that your database deployment does not have a potentially devastating single point of failure.

Please take a few seconds and answer the following poll. It will help the community get an idea of how companies are approaching HA in their critical database environments.

If you’re using other solutions or have specific issues, feel free to comment below. We’ll post a follow-up blog with the results!

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.
Categories: MySQL

Percona disabling TLSv1.0 May 31st 2016

MySQL Performance Blog - Mon, 2016-05-23 18:09

As of May 31st, 2016, we will be disabling TLSv1.0 support on,, etc.

This is ahead of the PCI changes that will affect the June 30th 2016 deprecation the TLSv1.0 protocol. (PDF)

What does this mean for you the user?

Based on analysis of our IDS logs, this will affect around 6.32% of requests. As of May 31st, such requests will present an error when trying to negotiate a TLS connection.

Users are advised to update their clients accordingly. SSLabs provides a good test for browsers, though this does not support command line tools. Going forward, we will only support TLSv1.1 and TLSv1.2.

These changes come a little over a year from our previous SSL overhaul, and are part of our ongoing effort to ensure the security of our users.

Thank you for your time. Please leave any questions in the comments section, or email us at security(at)



Categories: MySQL

Percona XtraBackup 2.4.3 is now available

MySQL Performance Blog - Mon, 2016-05-23 13:56

is glad to announce the GA release of Percona XtraBackup 2.4.3 on May 23rd, 2016. Downloads are available from our download site and from apt and yum repositories.

Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open source solution, Percona XtraBackup drives down backup costs while providing unique features for MySQL backups

New Features:

  • Percona XtraBackup has implemented new --reencrypt-for-server-id option. Using this option allows users to start the server instance with different server_id from the one the encrypted backup was taken from, like a replication slave or a Galera node. When this option is used, xtrabackup will, as a prepare step, generate a new master key with ID based on the new server_id, store it into keyring file and re-encrypt the tablespace keys inside of tablespace headers.

Bugs Fixed:

  • Running DDL statements on Percona Server 5.7 during the backup process could in some cases lead to failure while preparing the backup. Bug fixed #1555626.
  • MySQL 5.7 can sometimes skip redo logging when creating an index. If such ALTER TABLE is being issued during the backup, the backup would be inconsistent. xtrabackup will now abort with an error message if such ALTER TABLE has been done during the backup. Bug fixed #1582345.
  • .ibd files for remote tablespaces were not copied back to the original location pointed by the .isl files. Bug fixed #1555423.
  • When called with insufficient parameters, like specifying the empty --defaults-file option, Percona XtraBackup could crash. Bug fixed #1566228.
  • The documentation states that the default value for –ftwrl-wait-query-type is all, however it was update. Changed the default value to reflect the documentation. Bug fixed #1566315.
  • When –keyring-file-data option was specified, but no keyring file was found, xtrabackup would create an empty one instead of reporting an error. Bug fixed #1578607.
  • If ALTER INSTANCE ROTATE INNODB MASTER KEY was run at the same time when xtrabackup --backup was bootstrapping it could catch a moment when the key was not written into the keyring file yet and xtrabackup would overwrite the keyring with the old copy of a keyring, so the new key would be lost. Bug fixed #1582601.
  • The output of the --slave-info option was missing an apostrophe. Bug fixed #1573371.

Release notes with all the bugfixes for Percona XtraBackup 2.4.3 are available in our online documentation. Bugs can be reported on the launchpad bug tracker.

Categories: MySQL

Introduction to Troubleshooting Performance – Troubleshooting Slow Queries webinar: Q & A

MySQL Performance Blog - Fri, 2016-05-20 20:50

In this blog, I will provide answers to the Q & A for the Troubleshooting Slow Queries webinar.

First, I want to thank you for attending the April 28 webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: I’ve heard that is a bad idea to use select *; what do you recommend?

A: When I used SELECT * in my slides, I wanted to underline the idea that sometimes you need to select all columns from the table. There is nothing bad about it if you need them. SELECT * is bad when you need only a few columns from the table. In this case, you retrieve more data than needed, which affects performance. Another issue that  SELECT * can cause is if you hard-code the statement into your application, then change table definition; the application could start retrieving columns in wrong order and output (e.g., email instead of billing address). Or even worse, it will try to access a non-existent index in the result set array. The best practice is to explicitly enumerate all columns that your application needs.

Q: I heard that using index_field length will affect the indexing principle during query execution (e.g., one field is varchar and its length is not fixed, some values have short text, some values have long text. at this time). If we use this field as indexing, what happens?

A: I assume you are asking about the ability to create an index with lengths smaller than the column length? They work as follows:

Assume you have a TEXT  field which contains these user questions:

  1. I’ve heard that is a bad idea to use select * what do you recommend?
  2. I heard that using index_field length will affect the indexing principle during query execution (e.g., one field is varchar and its length is not fixed, some values have short text, some values have long text. at this time). If we use this field as indexing, what happens?
  3. ….

Since this is a TEXT  field you cannot create and index on it without specifying its length, so you need to make the index as minimal as possible to uniquely identify questions. If you create an index with length 10 it will contain:

  1. I’ve heard
  2. I heard th

You will index only those parts of questions that are not very distinct from each other, and do not contain useful information about what the question is. You can create index of length 255:

  1. I’ve heard that is a bad idea to use select * what do you recommend?
  2. I heard that using index_field length will affect the indexing principle during query execution (e.g., one field is varchar and its length is not fixed, some values have short text, some values have long text. at this time). If we use this field as index

In this case, the index includes the whole first question and almost all the second question. This makes the index too large and requires us to use more disk space (which causes more IO). Also, information from the second question is probably too much.

If make index of length 75, we will have:

  1. I’ve heard that is a bad idea to use select * what do you recommend?
  2. I heard that using index_field length will affect the indexing principle du

This is more than enough for the first question and gives a good idea of what is in the second question. It also potentially will have enough unique entries to make its cardinality look more like the cardinality of real data distribution.

To conclude: choosing the correct index length is something that requires practice and analysis of your actual data. Try to make them as short as possible, but long enough so that the number of unique entries in the index will be similar to a number of unique entries in the table.

Q: Which view can we query to see stats?

A: Do you mean index statistics? SHOW INDEX FROM table_name will do it.

Q: We have an InnoDB table with 47 fields (mostly text); some are ft-indexed. I tried to do an alter table, and it ran for 24 hours. What is the best way to run an alter table to add one extra field? The table has 1.9 M rows and 47 columns with many indexes.

A: Adding a column requires a table copy. Therefore, the speed of this operation depends on the table size and speed of your disk. If you are using version 5.6 and later, adding a column would not block parallel queries (and therefore is not a big deal). If you are using an older version, you can always use the pt-online-schema-change utility from Percona Toolkit. However, it will run even more slowly than the regular ALTER TABLE. Unfortunately, you cannot speed up the execution of ALTER TABLE much. The only thing that you can do is to use a faster disk (with options, tuned to explore speed of the disk).

However, if you do not want to have this increased IO affect the production server, you can alter the table on the separate instance, then copy tablespace to production and then apply all changes to the original table from the binary logs. The steps will be something like:

  1. Ensure you use option innodb_file_per_table  and the big table has individual tablespace
  2. Ensure that binary log is enabled
  3. Start a new server (you can also use an existent stand-by slave).
  4. Disable writes to the table
  5. Record the binary log position
  6. Copy the tablespace to the new server as described here.
  7. Enable writes on the production server
  8. Run ALTER TABLE on the new server you created in step 2 (it will still take 24 hours)
  9. Stop writes to the table on the production server
  10. Copy the tablespace, altered in step 7
  11. Apply all writes to this table, which are in the binary logs after position, recorded in step 4.
  12. Enable writes to the table

This scenario will take even more time overall, but will have minimal impact on the production server

Q: If there is a compound index like index1(emp_id,date), will the following query be able to use index? “select * from table1 where emp_id = 10”

A: Yes. At least it should.

Q: Are filesort and temporary in extended info for explain not good?

A: Regarding filesort: it depends. For example, you will always have the word filesort” for tables which perform ORDER BY and cannot use an index for ORDER BY. This is not always bad. For example, in this query:

mysql> explain select emp_no, first_name from employees where emp_no <20000 order by first_nameG *************************** 1. row *************************** id: 1 select_type: SIMPLE table: employees partitions: NULL type: range possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: NULL rows: 18722 filtered: 100.00 Extra: Using where; Using filesort 1 row in set, 1 warning (0,01 sec)

the primary key used to resolve rows and filesort were necessary and not avoidable. You can read about different filesort algorithms here.

Regarding Using temporary: this means what during query execution temporary table will be created. This is can be not good, especially if the temporary table is large and cannot fit into memory. In this case, it would be written to disk and slow down operations. But, again, sometimes creating temporary tables in not avoidable, for example, if you have both GROUP BY and ORDER BY clauses which list columns differently as stated in the user manual.

Q: Is key_len length more of a good thing for query execution?

A: key_len field is not NULL for all queries that use and index, and just shows the length of the key part used. It is not good or bad, it is just for information. You can use this information, for example, to identify which part of combined index is used to resolve the query.

Q: Does an alter query go for an optimizer check?

A: No. You can check it either by enabling optimizer trace, running ALTER and find what trace is empty. Or by enabling the debug option and searching the resulting trace for optimize.

Q: A query involves four columns that are all individually covered by an index. The optimizer didn’t merge indexes because of cost, and even didn’t choose the composite index I created.

A: This depends on the table definition and query you used. I cannot provide a more detailed answer based only on this information.

Q cont.: Finally, only certain composite indexes were suitable, the column order in the complex index mattered a lot. Why couldn’t the optimizer merge the four individual single column indexes, and why did the order of the columns in the composite index matter?

A: Regarding why the optimizer could not merge four indexes, I need to see how the table is defined and which data is in these indexed columns. Regarding why the order of the columns in the composite index matters, it depends on the query. Why the optimizer can use an index, say, on (col1, col2) where the conditions col1=X AND col2=Y and col2=Y AND col2=X for the case when you use OR, the order is important. For example, for the condition col1=X OR col2=Y, where the part col1=X is always executed and the part col2=Y  is executed only when col1=X is false. The same logic applies to queries like SELECT col1 WHERE col2=Y ORDER BY col3. See the user manual for details.

Q: When I try to obtain the optimizer trace on the console, the result is cut off. Even if I redirect the output to a file, how to overcome that?

A: Which version of MySQL Server do you use? The TRACE column is defined as longtext NOT NULL, and should not cause such issues. If it does with a newer version, report a bug at

Q: Are there any free graphical visualizers for either EXPLAIN or the optimizer TRACE output?

A: There is graphical visualizer for EXPLAIN in MySQL Workbench. But it works with online data only: you cannot run it on EXPLAIN output, saved into a file. I don’t know about any visualizer for the optimizer TRACE output. However, since it is JSON you can simply save it to file and open in web browser. It will allow a better view than if opened in simple text editor.

Q: When do you use force index instead of use index hints?

A: According to user manual “USE INDEX (index_list) hint tells MySQL to use only one of the named indexes to find rows in the table” and “FORCE INDEX  hint acts like USE INDEX (index_list), with the addition that a table scan is assumed to be very expensive . . . a table scan is used only if there is no way to use one of the named indexes to find rows in the table.” This means that when you use USE INDEX, you are giving a hint for the optimizer to prefer a particular index to others, but not enforcing index usage if the optimizer prefers a table scan, while FORCE INDEX requires using the index. I myself use only FORCE and IGNORE  index hints.

Q: Very informative session. I missed the beginning part. Are you going to distribute the recoded session later?

A: Yes. As usual slides and recording available here.

Categories: MySQL

Percona XtraDB Cluster 5.6.29-25.15 is now available

MySQL Performance Blog - Fri, 2016-05-20 14:06

is glad to announce the new release of Percona XtraDB Cluster 5.6 on May 20, 2016. Binaries are available from the downloads area or our software repositories.

Percona XtraDB Cluster 5.6.29-25.15 is now the current release, based on the following:

All of Percona software is open-source and free, and all the details of the release can be found in the 5.6.29-25.15 milestone at Launchpad.

For more information about relevant Codership releases, see this announcement.

Bugs Fixed:

  • Node eviction in the middle of SST now causes the node to shut down properly.
  • After an error during node startup, the state is now marked unsafe only if SST is required.
  • Fixed data inconsistency during multi-insert auto-increment workload on async master with binlog-format=STATEMENTwhen a node begins async slave with wsrep_auto_increment_control=ON.
  • Fixed crash when a prepare statement is aborted (due to a conflict with applier) and then replayed.
  • Removed a special case condition in wsrep_recover() that would not happen under normal conditions.
  • Percona XtraDB Cluster no longer fails during SST, if a node reserves a very large amount of memory for InnoDB buffer pool.
  • If the value of wsrep_cluster_address is not valid, trying to create a slave thread will now generate a warning instead of an error, and the thread will not be created.
  • Fixed error with loading data infile (LDI) into a multi-partitioned table.
  • The wsrep_node_name variable now defaults to host name.
  • Starting mysqld with unknown option now fails with a clear error message, instead of randomly crashing.
  • Optimized the operation of SST and IST when a node fails during startup.
  • The wsrep_desync variable can now be enabled only after a node is synced with cluster. That is, it cannot be set during node bootup configuration).
  • Fixed crash when setting a high flow control limit (fc_limit) and the recv queue fills up.
  • Only the default 16 KB page size (innodb_page_size=16384) is accepted until the relevant upstream bug is fixed by Codership (see All other sizes will report Invalid page size and shut down (the server will not start up).
  • If a node is executing RSU/FTWRL, explicit desync of the node will not happen until the implicit desync action is complete.
  • Fixed multiple bugs in the test suite to improve quality assurance.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Categories: MySQL

Fixing MySQL scalability problems with ProxySQL or thread pool

MySQL Performance Blog - Thu, 2016-05-19 20:58

In this blog post, we’ll discuss fixing MySQL scalability problems using either ProxySQL or thread pool.

In the previous post I showed that even MySQL 5.7 in read-write workloads is not able to maintain throughput. Oracle’s recommendation to play black magic with innodb_thread_concurrency and innodb_spin_wait_delay doesn’t always help. We need a different solution to deal with this scaling problem.

All the conditions are the same as in my previous run, but I will use:

  • ProxySQL limited to 200 connections to MySQL. ProxySQL has a capability to multiplex incoming connections; with this setting, even with 1000 connections to the proxy it will maintain only 200 connections to MySQL.
  • Percona Server with enabled thread pool, and a thread pool size of 64

You can see final results here:

There are good and bad sides for both solutions. With ProxySQL, there is a visible overhead on lower numbers of threads, but it keeps very stable throughput after 200 threads.

With Percona Server thread pool, there is little-to-no overhead if the number of connections is less than thread pool size, but after 200 threads it falls behind ProxySQL.

There is chart with response times

I would say the correct solution depends on your setup:

  • If you already use or plan to use ProxySQL, you may use it to prevent MySQL from saturation
  • If you use Percona Server, you might consider trying to adjust the thread pool



Categories: MySQL

Webinar Tuesday, May 24: Understanding how your MongoDB schema affects scaling, and when to consider sharding for help

MySQL Performance Blog - Thu, 2016-05-19 20:02

Please join David Murphy on Tuesday, May 24 at 10 am PDT (UTC-7) as he presents “Understanding how your MongoDB schema affects scaling, and when to consider sharding for help.”

David will discuss the pros and cons of a few MongoDB schema design patterns on a stand-alone machine, and then will look at how sharding affects them.  He’ll examine what assumptions did you make that could cause havoc on your CPU, memory and network during a scatter gather.   This webinar will help answer the questions:

  • Would you still use the same schema if you knew you were going to shard?
  • Are your fetches using the same shard, or employing parallelism to boost performance?
  • Are you following the golden rules of schema design?

Register for this webinar here.

David Murphy, MongoDB Practice Manager

David joined Percona in October 2015 as Practice Manager for MongoDB. Before that, David joined the ObjectRocket by Rackspace team as the Lead DBA in Sept 2013. With the growth involved with any recently acquired startup, David’s role covered a wide range of evangelism, research, run book development, knowledge base design, consulting, technical account management, mentoring and much more. Before the world of MongoDB, David was a MySQL and NoSQL architect at Electronic Arts working with some of the largest titles in the world like FIFA, SimCity, and Battle Field providing tuning, design, and technology choice responsibilities. David maintains an active interest in database speaking and exploring new technologies.

Categories: MySQL
Syndicate content