MySQL

ProxySQL 1.2.1 GA Release

MySQL Performance Blog - Thu, 2016-08-18 17:55

The GA release of ProxySQL 1.2.1 is available. You can get it from https://github.com/sysown/proxysql/releases. There are also Docker images for Release 1.2.1: https://hub.docker.com/r/percona/proxysql/.

ProxySQL is a high-performance proxy, currently for MySQL and its forks (like Percona Server and MariaDB). It acts as an intermediary for client requests seeking resources from the database. ProxySQL was created for DBAs by René Cannaò, as a means of solving complex replication topology issues.

This post is published with René’s approval. René is busy implementing more new ProxySQL features, so I decided to make this announcement!

Release highlights:
  • Support for backend SSL connections
  • Support for encrypted password  (mysql_users table now supports both plain text password and hashed password, in the same format of mysql.user.password)
  • Improved monitoring module
  • Better integration with Percona XtraDB Cluster
    • New feature: the Scheduler, that allows the extension of ProxySQL with external scripts

The last point is especially important in conjunction with our recent Percona XtraDB Cluster 5.7 RC1 release. When we ship Percona XtraDB Cluster 5.7 GA, we plan to make ProxySQL the default proxy solution choice for Percona XtraDB Cluster. ProxySQL is aware of the cluster and node status, and can direct traffic appropriately.

ProxySQL 1.2.1 comes with additional scripts to support this:

ProxySQL 1.2.1 and these scripts are compatible with existing Percona XtraDB Cluster 5.6 GA releases.

ProxySQL 1.2.1 is a solid release, currently used by many demanding high-performance production workloads – it is already battle tested! Give it a try if you are looking for a proxy solution.

ProxySQL is available under OpenSource license GPLv3, which allows you unlimited usage in production. ProxySQL has no plans to change the license!

Categories: MySQL

TokuDB/PerconaFT fragmented data file performance improvements

MySQL Performance Blog - Wed, 2016-08-17 17:05

In this blog post, we’ll discuss how we’ve improved TokuDB and PerconaFT fragmented data file performance.

Through our internal benchmarking and some user reports, we have found that with long term heavy write use TokuDB/PerconaFT performance can degrade significantly on large data files. Using smaller node sizes makes the problem worse (which is one of our performance tuning recommendations when you have faster storage). The problem manifests as low CPU utilization, a drop in overall TPS and high client response times during prolonged checkpointing.

This post explains a little about how PerconaFT structures dictionary files and where the current implementation breaks down. Hopefully, it explains the nature of the issue, and how our solution helps addresses it. It also provides some contrived benchmarks that prove the solution.

PerconaFT map file disk format

NOTE. This post uses the terms index, data file, and dictionary are somewhat interchangeable. We will use the PerconaFT term “dictionary” to refer specifically to a PerconaFT key/value data file.

PerconaFT stores every dictionary in its own data file on disk. TokuDB stores each index in a PerconaFT dictionary, plus one additional dictionary per table for some metadata. For example, if you have one TokuDB table with two secondary indices, you would have four data files or dictionaries: one small metadata dictionary for the table, one dictionary for the primary key/index, and one for each secondary index.

Each dictionary file has three major parts:

  • Two headers (yes, two) made up of various bits of metadata, file versions, a checkpoint logical sequence number (CLSN), the offset of this headers block translation table, etc…
  • Two (yes, two, one per header) block translation tables (BTT) that maps block numbers (BNs) to the physical offsets and sizes of the data blocks within the file.
  • Data blocks and holes (unused space). Unlike InnoDB, PerconaFT data blocks (nodes) are variable sizes and can be any size from a minimum of a few bytes for an empty internal node all the way up to the block size defined when the tree created (4MB by default if we don’t use compression) and anywhere in between, depending on the amount of data within that node.

Each dictionary file contains two versions of the header stored on disk, and only one is valid at any given point in time. Since we fix the size of the header structure, we always know their locations. The first at offset zero, the other is immediately after the first. The header that is currently valid is the header with the later/larger CLSN.

We write the header and the BTT to disk during a checkpoint or when a dictionary is closed (the only time we do so). The header overwrites the older header (the one with the older CLSN) on disk. From that moment onward, the disk space used by the previous version of the dictionary (the whole thing, not just the header) that is not also used by the latest version, is considered immediately free.

There is much more magic to how the PerconaFT does checkpoint and consistency, but that is really out of the scope of this post. Maybe a later post that addresses the sharp checkpoint of the PerconaFT can dive into this.

The block allocator

The block allocator is the algorithm and container that manages the list of known used blocks and unused holes within an open dictionary file. When a node gets written, it is the responsibility of the block allocator to find a suitable location in the file for the nodes data. It is always placed into a new block, never overwrites an existing block (except for reclaimed block space from blocks that are removed or moved and recorded during the last checkpoint). Conversely, when a node gets destroyed it is the responsibility of the block allocator to release that used space and create a hole out of the old block. That hole also must be merged with any other holes that are adjacent to it to have a record of just one large hole rather than a series of consecutive smaller holes.

Fragmentation and large files

The current implementation of the PerconaFT block allocator maintains a simple array of used blocks in memory for each open dictionary. The used blocks are ordered ascending by their offset in the file. The holes between the blocks are calculated by knowing the offset and size of the two bounding blocks. For example, one can calculate the hole offset and size between two adjacent blocks as: b[n].offset + b[n].size and b[n+1].offset – (b[n].offset + b[n].size), respectively.

To find a suitable hole to place node data, the current block allocator starts at the first block in the array. It iterates through the blocks looking for a hole between blocks that is large enough to hold the nodes data. Once we find a hole, we cut the space needed for the node out of the hole and the remainder is left as a hole for another block to possibly use later.

Note. Forcing alignment to 512 offsets for direct I/O has overhead, regardless if direct I/O is used or not.

This linear search severely degrades the PerconaFT performance for very large and fragmented dictionary files. We have some solid evidence from the field that this does occur. We can see it via various profiling tools as a lot of time spent within block_allocator_strategy::first_fit. It is also quite easy to create a case by using very small node (block) sizes and small fanouts (forces the existence of more nodes, and thus more small holes). This fragmentation can and does cause all sorts of side effects as the search operation locks the entire structure within memory. It blocks nodes from translating their node/block IDs into file locations.

Let’s fix it…

In this block storage paradigm, fragmentation is inevitable. We can try to dance around and propose different ways to prevent fragmentation (at the expense of higher CPU costs, online/offline operations, etc…). Or, we can look at the way the block allocator works and try to make it more efficient. Attacking the latter of the two options is a better strategy (not to say we aren’t still actively looking into the former).

Tree-based “Max Hole Size” (MHS) lookup

The linear search block allocator has no idea where bigger and smaller holes might be located within the set (a core limitation). It must use brute force to find a hole big enough for the data it needs to store. To address this, we implemented a new in-memory, tree-based algorithm (red-black tree). This replaces the current in-memory linear array and integrates the hole size search into the tree structure itself.

In this new block allocator implementation, we store the set of known in-use blocks within the node structure of a binary tree instead of a linear array. We order the  tree by the file offset of the blocks. We then added a little extra data to each node of this new tree structure. This data tells us the maximum hole we can expect to find in each child subtree. So now when searching for a hole, we can quickly drill down the tree to find an available hole of the correct size without needing to perform a fully linear scan. The trade off is that merging holes together and updating the parental max hole sizes is slightly more intricate and time-consuming than in a linear structure. The huge improvement in search efficiency makes this extra overhead pure noise.

We can see in this overly simplified diagram, we have five blocks:

  • offset 0 : 1 byte
  • offset 3 : 2 bytes
  • offset 6 : 3 bytes
  • offset 10 : 5 bytes
  • offset 20 : 8 bytes

We can calculate four holes in between those blocks:

  • offset 1 : 2 bytes
  • offset 5 : 1 byte
  • offset 9 : 1 byte
  • offset 15 : 5 bytes

We see that the search for a 4-byte hole traverses down the right side of the tree. It discovers a hole at offset 15. This hole is a big enough for our 4 bytes. It does this without needing to visit the nodes at offsets 0 and 3. For you algorithmic folks out there, we have gone from an O(n) to O(log n) search. This is tremendously more efficient when we get into severe fragmentation states. A side effect is that we tend to allocate blocks from holes closer to the needed size rather than from the first one big enough to fit. The small hole fragmentation issue may actually increase over time, but that has yet to be seen in our testing.

Benchmarks

As our CTO Vadim Tkachenko asserts, there are “Lies, Damned Lies and Benchmarks.” We’re going to show a simple test case where we thought, “What is the worst possible scenario that I can come up with in a small-ish benchmark to show the differences?”. So, rather than try and convince you using some pseudo-real-world benchmark that uses sleight of hand, I’m telling you up front that this example is slightly absurd, but pushes the issue to the foreground.

That scenario is actually pretty simple. We shape the tree to have as many nodes as possible, and intentionally use settings that reduce concurrency. We will use a standard sysbench OLTP test, and run it for about three hours after the prepare stage has completed:

  • Hardware:
    • Intel i7, 4 core hyperthread (8 virtual cores) @ 2.8 GHz
    • 16 GB of memory
    • Samsung 850 Pro SSD
  • Sysbench OLTP:
    • 1 table of 160M rows or about 30GB of primary key data and 4GB secondary key data
    • 24 threads
    • We started each test server instance with no data. Then we ran the sysbench prepare, then the sysbench run with no shutdown in between the prepare and run.
    • prepare command : /data/percona/sysbench/sysbench/sysbench –test=/data/percona/sysbench/sysbench/tests/db/parallel_prepare.lua –mysql-table-engine=tokudb –oltp-tables-count=1 –oltp-table-size=160000000 –mysql-socket=$(PWD)/var/mysql.sock –mysql-user=root –num_threads=1 run
    • run command : /data/percona/sysbench/sysbench/sysbench –test=/data/percona/sysbench/sysbench/tests/db/oltp.lua –mysql-table-engine=tokudb –oltp-tables-count=1 –oltp-table-size=160000000 –rand-init=on –rand-type=uniform –num_threads=24 –report-interval=30 –max-requests=0 –max-time=10800 –percentile=99 –mysql-socket=$(PWD)/var/mysql.sock –mysql-user=root run
  • mysqld/TokuDB configuration
    • innodb_buffer_pool_size=5242880
    • tokudb_directio=on
    • tokudb_empty_scan=disabled
    • tokudb_commit_sync=off
    • tokudb_cache_size=8G
    • tokudb_checkpointing_period=300
    • tokudb_checkpoint_pool_threads=1
    • tokudb_enable_partial_eviction=off
    • tokudb_fsync_log_period=1000
    • tokudb_fanout=8
    • tokudb_block_size=8K
    • tokudb_read_block_size=1K
    • tokudb_row_format=tokudb_uncompressed
    • tokudb_cleaner_period=1
    • tokudb_cleaner_iterations=10000

So as you can see: amazing results, right? Sustained throughput, immensely better response time and better utilization of available CPU resources. Of course, this is all fake with a tree shape that no sane user would implement. It illustrates what happens when the linear list contains small holes: exactly what we set out to fix!

Closing

Look for this improvement to appear in Percona Server 5.6.32-78.0 and 5.7.14-7. It’s a good one for you if you have huge TokuDB data files with lots and lots of nodes.

Credits!

Throughout this post, I referred to “we” numerous times. That “we” encompasses a great many people that have looked into this in the past and implemented the current solution. Some are current and former Percona and Tokutek employees that you may already know by name. Some are newer at Percona. I got to take their work and research, incorporate it into the current codebase, test and benchmark it, and report it here for all to see. Many thanks go out to Jun Yuan, Leif Walsh, John Esmet, Rich Prohaska, Bradley Kuszmaul, Alexey Stroganov, Laurynas Biveinis, Vlad Lesin, Christian Rober and others for all of the effort in diagnosing this issue, inventing a solution, and testing and reviewing this change to the PerconaFT library.

Categories: MySQL

How Apache Spark makes your slow MySQL queries 10x faster (or more)

MySQL Performance Blog - Wed, 2016-08-17 15:26

In this blog post, we’ll discuss how to improve the performance of slow MySQL queries using Apache Spark.

Introduction

In my previous blog post, I wrote about using Apache Spark with MySQL for data analysis and showed how to transform and analyze a large volume of data (text files) with Apache Spark. Vadim also performed a benchmark comparing performance of MySQL and Spark with Parquet columnar format (using Air traffic performance data). That works great, but what if we don’t want to move our data from MySQL to another storage (i.e., columnar format), and instead want to use “ad hock” queries on top of an existing MySQL server? Apache Spark can help here as well.

TL;DR version:

Using Apache Spark on top of the existing MySQL server(s) (without the need to export or even stream data to Spark or Hadoop), we can increase query performance more than ten times. Using multiple MySQL servers (replication or Percona XtraDB Cluster) gives us an additional performance increase for some queries. You can also use the Spark cache function to cache the whole MySQL query results table.

The idea is simple: Spark can read MySQL data via JDBC and can also execute SQL queries, so we can connect it directly to MySQL and run the queries. Why is this faster? For long running (i.e., reporting or BI) queries, it can be much faster as Spark is a massively parallel system. MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes. In my examples below, MySQL queries are executed inside Spark and run 5-10 times faster (on top of the same MySQL data).

In addition, Spark can add “cluster” level parallelism. In the case of MySQL replication or Percona XtraDB Cluster, Spark can split the query into a set of smaller queries (in the case of a partitioned table it will run one query per each partition for example) and run those in parallel across multiple slave servers of multiple Percona XtraDB Cluster nodes. Finally, it will use map/reduce the type of processing to aggregate the results.

I’ve used the same “Airlines On-Time Performance” database as in previous posts. Vadim created some scripts to download data and upload it to MySQL. You can find the scripts here: https://github.com/Percona-Lab/ontime-airline-performance. I’ve also used Apache Spark 2.0, which was released July 26, 2016.

Apache Spark Setup

Starting Apache Spark in standalone mode is easy. To recap:

  1. Download the Apache Spark 2.0 and place it somewhere.
  2. Start master
  3. Start slave (worker) and attach it to the master
  4. Start the app (in this case spark-shell or spark-sql)

Example:

root@thor:~/spark# ./sbin/start-master.sh less ../logs/spark-root-org.apache.spark.deploy.master.Master-1-thor.out 15/08/25 11:21:21 INFO Master: Starting Spark master at spark://thor:7077 15/08/25 11:21:21 INFO Utils: Successfully started service 'MasterUI' on port 8080. 15/08/25 11:21:21 INFO MasterWebUI: Started MasterWebUI at http://10.60.23.188:8080 root@thor:~/spark# ./sbin/start-slave.sh spark://thor:7077

To connect to Spark we can use spark-shell (Scala), pyspark (Python) or spark-sql. Since spark-sql is similar to MySQL cli, using it would be the easiest option (even “show tables” works). I also wanted to work with Scala in interactive mode so I’ve used spark-shell as well. In all the examples I’m using the same SQL query in MySQL and Spark, so working with Spark is not that different.

To work with MySQL server in Spark we need Connector/J for MySQL. Download the package and copy the mysql-connector-java-5.1.39-bin.jar to the spark directory, then add the class path to the conf/spark-defaults.conf:

spark.driver.extraClassPath = /usr/local/spark/mysql-connector-java-5.1.39-bin.jar spark.executor.extraClassPath = /usr/local/spark/mysql-connector-java-5.1.39-bin.jar

Running MySQL queries via Apache Spark

For this test I was using one physical server with 12 CPU cores (older Intel(R) Xeon(R) CPU L5639 @ 2.13GHz) and 48G of RAM, SSD disks. I’ve installed MySQL and started spark master and spark slave on the same box.

Now we are ready to run MySQL queries inside Spark. First, start the shell (from the Spark directory, /usr/local/spark in my case):

$ ./bin/spark-shell --driver-memory 4G --master spark://server1:7077

Then we will need to connect to MySQL from spark and register the temporary view:

val jdbcDF = spark.read.format("jdbc").options( Map("url" -> "jdbc:mysql://localhost:3306/ontime?user=root&password=", "dbtable" -> "ontime.ontime_part", "fetchSize" -> "10000", "partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2016", "numPartitions" -> "28" )).load() jdbcDF.createOrReplaceTempView("ontime")

So we have created a “datasource” for Spark (or in other words, a “link” from Spark to MySQL). The Spark table name is “ontime” (linked to MySQL ontime.ontime_part table) and we can run SQL queries in Spark, which in turn parse it and translate it in MySQL queries.

partitionColumn” is very important here. It tells Spark to run multiple queries in parallel, one query per each partition.

Now we can run the query:

val sqlDF = sql("select min(year), max(year) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and (origin = 'RDU' or dest = 'RDU') GROUP by carrier HAVING cnt > 100000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10") sqlDF.show()

MySQL Query Example

Let’s go back to MySQL for a second and look at the query example. I’ve chosen the following query (from my older blog post):

select min(year), max(year) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 100000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10

The query will find the total number of delayed flights per each airline. In addition, the query will calculate the smart “ontime” rating, taking into consideration the number of flights (we do not want to compare smaller air carriers with the large ones, and we want to exclude the older airlines who are not in business anymore).

The main reason I’ve chosen this query is that it is hard to optimize it in MySQL. All conditions in the “where” clause will only filter out ~70% of rows. I’ve done a basic calculation:

mysql> select count(*) FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI'); +-----------+ | count(*) | +-----------+ | 108776741 | +-----------+ mysql> select count(*) FROM ontime; +-----------+ | count(*) | +-----------+ | 152657276 | +-----------+ mysql> select round((108776741/152657276)*100, 2); +-------------------------------------+ | round((108776741/152657276)*100, 2) | +-------------------------------------+ | 71.26 | +-------------------------------------+

Table structure:

CREATE TABLE `ontime_part` ( `YearD` int(11) NOT NULL, `Quarter` tinyint(4) DEFAULT NULL, `MonthD` tinyint(4) DEFAULT NULL, `DayofMonth` tinyint(4) DEFAULT NULL, `DayOfWeek` tinyint(4) DEFAULT NULL, `FlightDate` date DEFAULT NULL, `UniqueCarrier` char(7) DEFAULT NULL, `AirlineID` int(11) DEFAULT NULL, `Carrier` char(2) DEFAULT NULL, `TailNum` varchar(50) DEFAULT NULL, ... `id` int(11) NOT NULL AUTO_INCREMENT, PRIMARY KEY (`id`,`YearD`), KEY `covered` (`DayOfWeek`,`OriginState`,`DestState`,`Carrier`,`YearD`,`ArrDelayMinutes`) ) ENGINE=InnoDB AUTO_INCREMENT=162668935 DEFAULT CHARSET=latin1 /*!50100 PARTITION BY RANGE (YearD) (PARTITION p1987 VALUES LESS THAN (1988) ENGINE = InnoDB, PARTITION p1988 VALUES LESS THAN (1989) ENGINE = InnoDB, PARTITION p1989 VALUES LESS THAN (1990) ENGINE = InnoDB, PARTITION p1990 VALUES LESS THAN (1991) ENGINE = InnoDB, PARTITION p1991 VALUES LESS THAN (1992) ENGINE = InnoDB, PARTITION p1992 VALUES LESS THAN (1993) ENGINE = InnoDB, PARTITION p1993 VALUES LESS THAN (1994) ENGINE = InnoDB, PARTITION p1994 VALUES LESS THAN (1995) ENGINE = InnoDB, PARTITION p1995 VALUES LESS THAN (1996) ENGINE = InnoDB, PARTITION p1996 VALUES LESS THAN (1997) ENGINE = InnoDB, PARTITION p1997 VALUES LESS THAN (1998) ENGINE = InnoDB, PARTITION p1998 VALUES LESS THAN (1999) ENGINE = InnoDB, PARTITION p1999 VALUES LESS THAN (2000) ENGINE = InnoDB, PARTITION p2000 VALUES LESS THAN (2001) ENGINE = InnoDB, PARTITION p2001 VALUES LESS THAN (2002) ENGINE = InnoDB, PARTITION p2002 VALUES LESS THAN (2003) ENGINE = InnoDB, PARTITION p2003 VALUES LESS THAN (2004) ENGINE = InnoDB, PARTITION p2004 VALUES LESS THAN (2005) ENGINE = InnoDB, PARTITION p2005 VALUES LESS THAN (2006) ENGINE = InnoDB, PARTITION p2006 VALUES LESS THAN (2007) ENGINE = InnoDB, PARTITION p2007 VALUES LESS THAN (2008) ENGINE = InnoDB, PARTITION p2008 VALUES LESS THAN (2009) ENGINE = InnoDB, PARTITION p2009 VALUES LESS THAN (2010) ENGINE = InnoDB, PARTITION p2010 VALUES LESS THAN (2011) ENGINE = InnoDB, PARTITION p2011 VALUES LESS THAN (2012) ENGINE = InnoDB, PARTITION p2012 VALUES LESS THAN (2013) ENGINE = InnoDB, PARTITION p2013 VALUES LESS THAN (2014) ENGINE = InnoDB, PARTITION p2014 VALUES LESS THAN (2015) ENGINE = InnoDB, PARTITION p2015 VALUES LESS THAN (2016) ENGINE = InnoDB, PARTITION p_new VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */

Even with a “covered” index, MySQL will have to scan ~70M-100M of rows and create a temporary table:

mysql> explain select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_part WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ontime_part type: range possible_keys: covered key: covered key_len: 2 ref: NULL rows: 70483364 Extra: Using where; Using index; Using temporary; Using filesort 1 row in set (0.00 sec)

What is the query response time in MySQL:

mysql> select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_part WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10; +------------+----------+---------+----------+-----------------+------+ | min(yearD) | max_year | Carrier | cnt | flights_delayed | rate | +------------+----------+---------+----------+-----------------+------+ | 2003 | 2013 | EV | 2962008 | 464264 | 0.16 | | 2003 | 2013 | B6 | 1237400 | 187863 | 0.15 | | 2006 | 2011 | XE | 1615266 | 230977 | 0.14 | | 2003 | 2005 | DH | 501056 | 69833 | 0.14 | | 2001 | 2013 | MQ | 4518106 | 605698 | 0.13 | | 2003 | 2013 | FL | 1692887 | 212069 | 0.13 | | 2004 | 2010 | OH | 1307404 | 175258 | 0.13 | | 2006 | 2013 | YV | 1121025 | 143597 | 0.13 | | 2003 | 2006 | RU | 1007248 | 126733 | 0.13 | | 1988 | 2013 | UA | 10717383 | 1327196 | 0.12 | +------------+----------+---------+----------+-----------------+------+ 10 rows in set (19 min 16.58 sec)

19 minutes is definitely not great.

SQL in Spark

Now we want to run the same query inside Spark and let Spark read data from MySQL. We will create a “datasource” and execute the query:

scala> val jdbcDF = spark.read.format("jdbc").options( | Map("url" -> "jdbc:mysql://localhost:3306/ontime?user=root&password=mysql", | "dbtable" -> "ontime.ontime_sm", | "fetchSize" -> "10000", | "partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2015", "numPartitions" -> "48" | )).load() 16/08/02 23:24:12 WARN JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: 27; Input number of partitions: 48; Lower bound: 1988; Upper bound: 2015. dbcDF: org.apache.spark.sql.DataFrame = [id: int, YearD: date ... 19 more fields] scala> jdbcDF.createOrReplaceTempView("ontime") scala> val sqlDF = sql("select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10") sqlDF: org.apache.spark.sql.DataFrame = [min(yearD): date, max_year: date ... 4 more fields] scala> sqlDF.show() +----------+--------+-------+--------+---------------+----+ |min(yearD)|max_year|Carrier| cnt|flights_delayed|rate| +----------+--------+-------+--------+---------------+----+ | 2003| 2013| EV| 2962008| 464264|0.16| | 2003| 2013| B6| 1237400| 187863|0.15| | 2006| 2011| XE| 1615266| 230977|0.14| | 2003| 2005| DH| 501056| 69833|0.14| | 2001| 2013| MQ| 4518106| 605698|0.13| | 2003| 2013| FL| 1692887| 212069|0.13| | 2004| 2010| OH| 1307404| 175258|0.13| | 2006| 2013| YV| 1121025| 143597|0.13| | 2003| 2006| RU| 1007248| 126733|0.13| | 1988| 2013| UA|10717383| 1327196|0.12| +----------+--------+-------+--------+---------------+----+

spark-shell does not show the query time. This can be retrieved from Web UI or from spark-sql. I’ve re-run the same query in spark-sql:

./bin/spark-sql --driver-memory 4G --master spark://thor:7077 spark-sql> CREATE TEMPORARY VIEW ontime > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:mysql://localhost:3306/ontime?user=root&password=", > dbtable "ontime.ontime_part", > fetchSize "1000", > partitionColumn "yearD", lowerBound "1988", upperBound "2014", numPartitions "48" > ); 16/08/04 01:44:27 WARN JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: 26; Input number of partitions: 48; Lower bound: 1988; Upper bound: 2014. Time taken: 3.864 seconds spark-sql> select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10; 16/08/04 01:45:13 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. 2003 2013 EV 2962008 464264 0.16 2003 2013 B6 1237400 187863 0.15 2006 2011 XE 1615266 230977 0.14 2003 2005 DH 501056 69833 0.14 2001 2013 MQ 4518106 605698 0.13 2003 2013 FL 1692887 212069 0.13 2004 2010 OH 1307404 175258 0.13 2006 2013 YV 1121025 143597 0.13 2003 2006 RU 1007248 126733 0.13 1988 2013 UA 10717383 1327196 0.12 Time taken: 139.628 seconds, Fetched 10 row(s)

So the response time of the same query is almost 10x faster (on the same server, just one box). But now how was this query translated to MySQL queries, and why it is so much faster? Here is what is happening inside MySQL:

Inside MySQL

Spark:

scala> sqlDF.show() [Stage 4:> (0 + 26) / 26]

MySQL:

mysql> select id, info from information_schema.processlist where info is not NULL and info not like '%information_schema%'; +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | id | info | +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 10948 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2001 AND yearD < 2002) | | 10965 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2007 AND yearD < 2008) | | 10966 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1991 AND yearD < 1992) | | 10967 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1994 AND yearD < 1995) | | 10968 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1998 AND yearD < 1999) | | 10969 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2010 AND yearD < 2011) | | 10970 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2002 AND yearD < 2003) | | 10971 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2006 AND yearD < 2007) | | 10972 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1990 AND yearD < 1991) | | 10953 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2009 AND yearD < 2010) | | 10947 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1993 AND yearD < 1994) | | 10956 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD < 1989 or yearD is null) | | 10951 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2005 AND yearD < 2006) | | 10954 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1996 AND yearD < 1997) | | 10955 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2008 AND yearD < 2009) | | 10961 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1999 AND yearD < 2000) | | 10962 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2011 AND yearD < 2012) | | 10963 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2003 AND yearD < 2004) | | 10964 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1995 AND yearD < 1996) | | 10957 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2004 AND yearD < 2005) | | 10949 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1989 AND yearD < 1990) | | 10950 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1997 AND yearD < 1998) | | 10952 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2013) | | 10958 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1992 AND yearD < 1993) | | 10960 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2000 AND yearD < 2001) | | 10959 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2012 AND yearD < 2013) | +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 26 rows in set (0.00 sec)

Spark is running 26 queries in parallel, which is great. As the table is partitioned it only uses one partition per query, but scans the whole partition:

mysql> explain partitions SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2001 AND yearD < 2002)G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ontime_part partitions: p2001 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 5814106 Extra: Using where 1 row in set (0.00 sec)

In this case, as the box has 12 CPU cores / 24 threads, it efficently executes 26 queries in parallel and the partitioned table helps to avoid contention issues (I wish MySQL could scan partitions in parallel, but it can’t at the time of writing).

Another interesting thing is that Spark can “push down” some of the conditions to MySQL, but only those inside the “where” clause. All group by/order by/aggregations are done inside Spark. It  needs to retrieve data from MySQL to satisfy those conditions and will not push down group by/order by/etc to MySQL.

That also means that queries without “where” conditions (for example “select count(*) as cnt, carrier from ontime group by carrier order by cnt desc limit 10”) will have to retrieve all data from MySQL and load it to Spark (as opposed to MySQL will do all group by inside). Running it in Spark might be slower or faster (depending on the amount of data and use of indexes) but it also requires more resources and potentially more memory dedicated for Spark. The above query is translated to 26 queries, each does a “select carrier from ontime_part where (yearD >= N AND yearD < N)”

Pushing down the whole query into MySQL 

If we want to avoid sending all data from MySQL to Spark we have the option of creating a temporary table on top of a query (similar to MySQL’s create temporary table as select …). In Scala:

val tableQuery = "(select yeard, count(*) from ontime group by yeard) tmp" val jdbcDFtmp = spark.read.format("jdbc").options( Map("url" -> "jdbc:mysql://localhost:3306/ontime?user=root&password=", "dbtable" -> tableQuery, "fetchSize" -> "10000" )).load() jdbcDFtmp.createOrReplaceTempView("ontime_tmp")

In Spark SQL:

CREATE TEMPORARY VIEW ontime_tmp USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql://localhost:3306/ontime?user=root&password=mysql", dbtable "(select yeard, count(*) from ontime_part group by yeard) tmp", fetchSize "1000" ); select * from ontime_tmp;

Please note:

  1. We do not want to use “partitionColumn” here, otherwise we will see 26 queries like this in MySQL: “SELECT yeard, count(*) FROM (select yeard, count(*) from ontime_part group by yeard) tmp where (yearD >= N AND yearD < N)” (obviously not optimal)
  2. This is not a good use of Spark, more like a “hack.” The only good reason to do it is to be able to have the result of the query as a source of an additional query.
Query cache in Spark

Another option is to cache the result of the query (or even the whole table) and then use .filter in Scala for faster processing. This requires sufficient memory dedicated for Spark. The good news is we can add additional nodes to Spark and get more memory for Spark cluster.

Spark SQL example:

CREATE TEMPORARY VIEW ontime_latest USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql://localhost:3306/ontime?user=root&password=", dbtable "ontime.ontime_part partition (p2013, p2014)", fetchSize "1000", partitionColumn "yearD", lowerBound "1988", upperBound "2014", numPartitions "26" ); cache table ontime_latest; spark-sql> cache table ontime_latest; Time taken: 465.076 seconds spark-sql> select count(*) from ontime_latest; 5349447 Time taken: 0.526 seconds, Fetched 1 row(s) spark-sql> select count(*), dayofweek from ontime_latest group by dayofweek; 790896 1 634664 6 795540 3 794667 5 808243 4 743282 7 782155 2 Time taken: 0.541 seconds, Fetched 7 row(s) spark-sql> select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_latest WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and (origin='RDU' or dest = 'RDU') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10; 2013 2013 MQ 9339 1734 0.19 2013 2013 B6 3302 516 0.16 2013 2013 EV 9225 1331 0.14 2013 2013 UA 1317 177 0.13 2013 2013 AA 5354 620 0.12 2013 2013 9E 5520 593 0.11 2013 2013 WN 10968 1130 0.1 2013 2013 US 5722 549 0.1 2013 2013 DL 6313 478 0.08 2013 2013 FL 2433 205 0.08 Time taken: 2.036 seconds, Fetched 10 row(s)

Here we cache partitions p2013 and p2014 in Spark. This retrieves the data from MySQL and loads it in Spark. After that all queries run on the cached data and will be much faster.

With Scala we can cache the result of a query and then use filters to only get the information we need:

val sqlDF = sql("SELECT flightdate, origin, dest, depdelayminutes, arrdelayminutes, carrier, TailNum, Cancelled, Diverted, Distance from ontime") sqlDF.cache().show() scala> sqlDF.filter("flightdate='1988-01-01'").count() res5: Long = 862

Using Spark with Percona XtraDB Cluster

As Spark can be used in a cluster mode and scale with more and more nodes, reading data from a single MySQL is a bottleneck. We can use MySQL replication slave servers or Percona XtraDB Cluster (PXC) nodes as a Spark datasource. To test it out, I’ve provisioned Percona XtraDB Cluster with three nodes on AWS (I’ve used m4.2xlarge Ubuntu instances) and also started Apache Spark on each node:

  1. Node1 (pxc1): Percona Server + Spark Master + Spark worker node + Spark SQL running
  2. Node2 (pxc2): Percona Server + Spark worker node
  3. Node3 (pxc3): Percona Server + Spark worker node

All the Spark worker nodes use the memory configuration option:

cat conf/spark-env.sh export SPARK_WORKER_MEMORY=24g

Then I can start spark-sql (also need to have connector/J JAR file copied to all nodes):

$ ./bin/spark-sql --driver-memory 4G --master spark://pxc1:7077

When creating a table, I still use localhost to connect to MySQL (url “jdbc:mysql://localhost:3306/ontime?user=root&password=xxx”). As Spark worker nodes are running on the same instance as Percona Cluster nodes, it will use the local connection. Then running a Spark SQL will evenly distribute all 26 MySQL queries among the three MySQL nodes.

Alternatively we can run Spark cluster on a separate host and connect it to the HA Proxy, which in turn will load balance selects across multiple Percona XtraDB Cluster nodes.

Query Performance Benchmark

Finally, here is the query response time test on the three AWS Percona XtraDB Cluster nodes:

Query 1: select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_part WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10;

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement No covered index (partitioned) 19 min 16.58 sec 192.17 sec 6.02 Covered index (partitioned) 2 min 10.81 sec 48.38 sec 2.7

 

Query 2: select dayofweek, count(*) from ontime_part group by dayofweek;

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement No covered index (partitoned) 19 min 15.21 sec 195.058 sec 5.92 Covered index (partitioned) 1 min 10.38 sec 27.323 sec 2.58

 

Now, this looks really good, but it can be better. With three nodes @ m4.2xlarge we will have 8*3 = 24 cores total (although they are shared between Spark and MySQL). We can expect 10x improvement, especially without a covered index.

However, on m4.2xlarge the amount of RAM did not allow me to run MySQL out of memory, so all reads were from EBS non-provisioned IOPS, which only gave me ~120MB/sec. I’ve redone the test on a set of three dedicated servers:

  • 28 cores E5-2683 v3 @ 2.00GHz
  • 240GB of RAM
  • Samsung 850 PRO

The test was running completely off RAM:

Query 1 (from the above)

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement No covered index (partitoned) 3 min 13.94 sec 14.255 sec 13.61 Covered index (partitioned) 2 min 2.11 sec 9.035 sec 13.52

 

Query 2: select dayofweek, count(*) from ontime_part group by dayofweek;

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement No covered index (partitoned)  2 min 0.36 sec 7.055 sec 17.06 Covered index (partitioned) 1 min 6.85 sec 4.514 sec 14.81

 

With this amount of cores and running out of RAM we actually do not have enough concurrency as the table only have 26 partitions. I’ve tried the unpartitioned table with ID primary key and use 128 partitions.

Note about partitioning

I’ve used partitioned table (partition by year) in my tests to help reduce MySQL level contention. At the same time the “partitionColumn” option in Spark does not require that MySQL table is partitioned. For example, if a table has a primary key, we can use this CREATE VIEW in Spark :

CREATE OR REPLACE TEMPORARY VIEW ontime USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql://127.0.0.1:3306/ontime?user=root&password=", dbtable "ontime.ontime", fetchSize "1000", partitionColumn "id", lowerBound "1", upperBound "162668934", numPartitions "128" );

Assuming we have enough MySQL servers (i.e., nodes or slaves), we can increase the number of partitions and that can improve the parallelism (as opposed to only 26 partitions when running one partition by year). Actually, the above test gives us even better response time: 6.44 seconds for query 1.

Where Spark doesn’t work well

For faster queries (those that use indexes or can efficiently use an index) it does not make sense to use Spark. Retrieving data from MySQL and loading it into Spark is not free. This overhead can be significant for faster queries. For example, a query like this select count(*) from ontime_part where YearD = 2013 and DayOfWeek = 7 and OriginState = 'NC' and DestState = 'NC'; will only scan 1300 rows and will return instant (0.00 seconds reported by MySQL).

An even better example is this: select max(id) from ontime_part. In MySQL, the query will use the index and all calculations will be done inside MySQL. Spark, on the other hand, will have to retrieve all IDs (select id from ontime_part) from MySQL and calculate maximum. That took 24.267 seconds.

Conclusion

Using Apache Spark as an additional engine level on top of MySQL can help to speed up the slow reporting queries and add much-needed scalability for the long running select queries. In addition, Spark can help with query caching for frequent queries.

PS: Visual explain plan with Spark

Spark Web GUI provides lots of ways of monitoring Spark jobs. For example, it shows the “job” progress:

And SQL visual explain details:

Categories: MySQL

Webinar Thursday 8/18: Preventing and Resolving MySQL Downtime

MySQL Performance Blog - Tue, 2016-08-16 17:14

Join Percona’s Jervin Real for a webinar on Thursday August 18, 2016 at 10 am PDT (UTC-7) on Preventing and Resolving MySQL Downtime.

Preventing MySQL downtime and emergencies is difficult. Often complex combinations of several things going wrong cause these emergencies. Without knowledge of the causes of emergencies, preventative proactive measures often fail to prevent further problems — no matter how sincere. This talk discusses some of the ways to prevent real production system emergencies, and suggests specific actions for:

  • Application stack configuration
  • MySQL server configuration
  • Operating system configuration
  • Troublesome server features
  • Special features of Percona Server
  • MySQL health checks
  • Percona Toolkit

Register for the webinar here.

 

Jervin Real, Technical Services Manager As Technical Services Manager, Jervin partners with Percona’s customers on building reliable and highly performant MySQL infrastructures, while also doing other fun stuff like watching cat videos on the internet. Jervin joined Percona in Apr 2010. Starting as a PHP programmer, Jervin quickly got involved with the LAMP stack. He has worked on several high-traffic sites and a number of specialized web applications: i.e., mobile content distribution. Before joining Percona, Jervin also worked with several hosting companies, providing care for customer hosted services and data on both Linux and Windows.
Categories: MySQL

Percona Toolkit 2.2.19 is now available

MySQL Performance Blog - Tue, 2016-08-16 15:05

Percona is pleased to announce the availability of Percona Toolkit 2.2.19.  Released August 16, 2016. Percona Toolkit is a collection of advanced command-line tools that perform a variety of MySQL server and system tasks that DBAs find too difficult or complex for to perform manually. Percona Toolkit, like all Percona software, is free and open source.

This release is the current GA (Generally Available) stable release in the 2.2 series. Downloads are available here and from the Percona Software Repositories.

New Features:
  • 1221372: pt-online-schema-change now aborts with an error if the server is a slave, because this can break data consistency in case of row-based replication. If you are sure that the slave will not use row-based replication, you can disable this check using the --force-slave-run option.
  • 1485195: pt-table-checksum now forces replica table character set to UTF-8.
  • 1517155: Introduced --create-table-engine option to pt-heartbeat, which sets a storage engine for the heartbeat table different from the database default engine.
  • 1595678 and 1595912: Introduced --slave-user and --slave-password options to pt-online-schema-change, pt-table-sync, and pt-table-checksum.
  • 1610385: pt-online-schema-change now re-checks the list of slaves in the DSN table. This enables changing the contents of the table while the tool is running.
Bugs Fixed:
  • 1581752: Fixed pt-query-digest date and time parsing from MySQL 5.7 slow query log.
  • 1592166: Fixed memory leak when pt-kill kills a query.
  • 1592608: Fixed overflow of CONCAT_WS when pt-table-checksum or pt-table-sync checksums large BLOB, TEXT, or BINARY columns.
  • 1593265: Fixed pt-archiver deleting rows that were not archived.
  • 1610386: Fixed pt-slave-restart handling of GTID ranges where the left-side integer is larger than 9.
  • 1610387: Removed extra word ‘default’ from the --verbose help for pt-slave-restart.
  • 1610388: Fixed pt-table-sync not quoting enum values properly. They are now recognized as CHAR fields.

Find release details in the release notes and the 2.2.19 milestone at Launchpad. Report bugs on the Percona Toolkit launchpad bug tracker

Categories: MySQL

I’m Colin Charles, and I’m here to evangelize open source databases!

MySQL Performance Blog - Tue, 2016-08-16 14:02

Let me introduce myself, I’m Colin Charles.

Percona turns ten years old this year. To me, there is no better time to join the company as the Chief Evangelist in the CTO office.

I’ve been in the MySQL world a tad longer than Percona has, and have had the pleasure of working on MySQL at MySQL AB and Sun Microsystems. Most recently I was one of the founding team members for MariaDB Server in 2009. I watched that grow into the MariaDB Corporation (after the merger with SkySQL) and the MariaDB Foundation.

For me, it’s about the right server for the right job. Today they all support a myriad of different features and different storage engines. Each server has its own community that supports and discusses their pros and cons. This is now true for both the MySQL and MongoDB ecosystems.

I’ve always had a lot of respect for the work Percona does — pragmatic engineering, deeply technical consulting (and blog posts) and amazing conferences. A big deal for me, and a big reason why I’m now here, is that Percona truly believes in the spirit of open source software development. Their obvious support of the open source community is a great pull factor for users as well.

I just spent time on the Percona Live Europe conference committee. (I’ve been involved in MySQL-related conferences since 2006, and was even Program Chair for a couple of years). There, I got to see how the conference is evolving beyond just stock MySQL to also include MongoDB and other open source databases

Recently I visited a customer who was not just interested in using a database, but also in offering a database-as-a-service to their internal customers. I discussed OpenStack with them, and knowing that Percona, the company I now represent, can support the architecture and deployment too? That’s kind of priceless.

We’re all crazy about databases and their position in the overall IT structure. They provide us with cool apps, internet functionality, and all sorts of short cuts to our daily lives. Percona’s role in providing solutions that address the issues that infrastructure faces is what really excites me about my new journey here.

Categories: MySQL

Percona Live Europe 2016 Schedule Now Live

MySQL Performance Blog - Mon, 2016-08-15 17:01

This post reveals the full Percona Live Europe 2016 schedule for Amsterdam this October 3-5.

The official The Percona Live Open Source Database Conference Europe 2016 schedule is now live, and you can find it here.

The schedule demonstrates that this conference has something for everyone! Whether your interest is in MySQL, MongoDB or other open source databases, there are talks that will interest you.

The Percona Live Open Source Database Conference is the premier event for the diverse and active open source database community, as well as businesses that develop and use open source database software. The conferences have a technical focus with an emphasis on the core topics of MySQL, MongoDB, and other open source databases. Tackling subjects such as analytics, architecture and design, security, operations, scalability and performance, Percona Live provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs. This conference is an opportunity to network with peers and technology professionals by bringing together accomplished DBA’s, system architects and developers from around the world to share their knowledge and experience – all to help you learn how to tackle your open source database challenges in a whole new way.

Some of the talks for each area are:

MySQL

MongoDB

Open Source Databases

Check out the full schedule now!

Advanced Tickets

Purchase your passes now and get the advanced tickets discount. The earlier you buy, the better the value. You can register for Percona Live Europe here.

Sponsor Percona Live

Sponsor the Percona Live Open Source Database Performance Conference Europe 2016. Sponsorship gets you bigger visibility at the most important open source database conference in Europe. Benefits to sponsorship include:

  • Worldwide Audience: Made up of DBAs, developers, CTOs, CEOs, technology evangelists, entrepreneurs, and technology vendors.
  • Perfect Location: In Amsterdam City Centre, walking distance from Amsterdam Central Station.
  • Perfect Event: The showcase event for the rich and diverse MySQL, MongoDB and open source database markets in Europe.

Click here to sponsor now.

Categories: MySQL

Tuning Linux for MongoDB

MySQL Performance Blog - Fri, 2016-08-12 19:36

In this post, we’ll discuss tuning Linux for MongoDB deployments.

By far the most common operating system you’ll see MongoDB running on is Linux 2.6 and 3.x. Linux flavors such as CentOS and Debian do a fantastic job of being a stable, general-purpose operating system. Linux runs software on hardware ranging from tiny computers like the Raspberry Pi up to massive data center servers. To make this flexibility work, however, Linux defaults to some “lowest common denominator” tunings so that the OS will boot on anything.

Working with databases, we often focus on the queries, patterns and tunings that happen inside the database process itself. This means we sometimes forget that the operating system below it is the life-support of database, the air that it breathes so-to-speak. Of course, a highly-scalable database such as MongoDB runs fine on these general-purpose defaults without complaints, but the efficiency can be equivalent to running in regular shoes instead of sleek runners. At small scale, you might not notice the lost efficiency, but at large scale (especially when data exceeds RAM) improved tunings equate to fewer servers and less operational costs. For all use cases and scale, good OS tunings also provide some improvement in response times and removes extra “what if…?” questions when troubleshooting.

Overall, memory, network and disk are the system resources important to MongoDB. This article covers how to optimize each of these areas. Of course, while we have successfully deployed these tunings to many live systems, it’s always best to test before applying changes to your servers.

If you plan on applying these changes, I suggest performing them with one full reboot of the host. Some of these changes don’t require a reboot, but test that they get re-applied if you reboot in the future. MongoDB’s clustered nature should make this relatively painless, plus it might be a good time to do that dreaded “yum upgrade” / “aptitude upgrade“, too.

Linux Ulimit

To prevent a single user from impacting the entire system, Linux has a facility to implement some system resource constraints on processes, file handles and other system resources on a per-user-basis. For medium-high-usage MongoDB deployments, the default limits are almost always too low. Considering MongoDB generally uses dedicated hardware, it makes sense to allow the Linux user running MongoDB (e.g., “mongod”) to use a majority of the available resources.

Now you might be thinking: “Why not disable the limit (or set it to unlimited)?” This is a common recommendation for database servers. I think you should avoid this for two reasons:

  • If you hit a problem, a lack of a limit on system resources can allow a relatively smaller problem to spiral out of control, often bringing down other services (such as SSH) crucial to solving the original problem.
  • All systems DO have an upper-limit, and understanding those limitations instead of masking them is an important exercise.

In most cases, a limit of 64,000 “max user processes” and 64,000 “open files” (both have defaults of 1024) will suffice. To be more exact you need to do some math on the number of applications/clients, the maximum size of their connection pools and some case-by-case tuning for the number of inter-node connections between replica set members and sharding processes. (We might address this in a future blog post.)

You can deploy these limits by adding a file in “/etc/security/limits.d” (or appending to “/etc/security/limits.conf” if there is no “limits.d”). Below is an example file for the Linux user “mongod”, raising open-file and max-user-process limits to 64,000:

mongod       soft        nproc        64000 mongod       hard        nproc        64000 mongod       soft        nofile       64000 mongod       hard        nofile       64000

Note: this change only applies to new shells, meaning you must restart “mongod” or “mongos” to apply this change!

Virtual Memory Dirty Ratio

The “dirty_ratio” is the percentage of total system memory that can hold dirty pages. The default on most Linux hosts is between 20-30%. When you exceed the limit the dirty pages are committed to disk, creating a small pause. To avoid this hard pause there is a second ratio: “dirty_background_ratio” (default 10-15%) which tells the kernel to start flushing dirty pages to disk in the background without any pause.

20-30% is a good general default for “dirty_ratio”, but on large-memory database servers this can be a lot of memory! For example, on a 128GB-memory host this can allow up to 38.4GB of dirty pages. The background ratio won’t kick in until 12.8GB! We recommend that you lower this setting and monitor the impact to query performance and disk IO. The goal is reducing memory usage without impacting query performance negatively. Reducing caches sizes also guarantees data gets written to disk in smaller batches more frequently, which increases disk throughput (than huge bulk writes less often).

A recommended setting for dirty ratios on large-memory (64GB+ perhaps) database servers is: “vm.dirty_ratio = 15″ and vm.dirty_background_ratio = 5″, or possibly less. (Red Hat recommends lower ratios of 10 and 3 for high-performance/large-memory servers.)

You can set this by adding the following lines to /etc/sysctl.conf”:

vm.dirty_ratio = 15 vm.dirty_background_ratio = 5

To check these current running values:

$ sysctl -a | egrep "vm.dirty.*_ratio" vm.dirty_background_ratio = 5 vm.dirty_ratio = 15

Swappiness

“Swappiness” is a Linux kernel setting that influences the behavior of the Virtual Memory manager when it needs to allocate a swap, ranging from 0-100. A setting of 0 tells the kernel to swap only to avoid out-of-memory problems. A setting of 100 tells it to swap aggressively to disk. The Linux default is usually 60, which is not ideal for database usage.

It is common to see a setting of 0″ (or sometimes “10”) on database servers, telling the kernel to prefer to swap to memory for better response times. However, Ovais Tariq details a known bug (or feature) when using a setting of 0 in this blog post: https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/.

Due to this bug, we recommended using a setting of 1″ (or “10” if you  prefer some disk swapping) by adding the following to your /etc/sysctl.conf”:

vm.swappiness = 1

To check the current swappiness:

$ sysctl vm.swappiness vm.swappiness = 1

Note: you must run the command “/sbin/sysctl -p” as root/sudo (or reboot) to apply a dirty_ratio or swappiness change!

Transparent HugePages

*Does not apply to Debian/Ubuntu or CentOS/RedHat 5 and lower*

Transparent HugePages is an optimization introduced in CentOS/RedHat 6.0, with the goal of reducing overhead on systems with large amounts of memory. However, due to the way MongoDB uses memory, this feature actually does more harm than good as memory access are rarely contiguous.

Disabled THP entirely by adding the following flag below to your Linux kernel boot options:

transparent_hugepage=never

Usually this requires changes to the GRUB boot-loader config in the directory /boot/grub” or /etc/grub.d” on newer systems. Red Hat covers this in more detail in this article (same method on CentOS): https://access.redhat.com/solutions/46111.

Note: We recommended rebooting the system to clear out any previous huge pages and validate that the setting will persist on reboot.

NUMA (Non-Uniform Memory Access) Architecture

Non-Uniform Memory Access is a recent memory architecture that takes into account the locality of caches and CPUs for lower latency. Unfortunately, MongoDB is not “NUMA-aware” and leaving NUMA setup in the default behavior can cause severe memory in-balance.

There are two ways to disable NUMA: one is via an on/off switch in the system BIOS config, the 2nd is using the numactl” command to set NUMA-interleaved-mode (similar effect to disabling NUMA) when starting MongoDB. Both methods achieve the same result. I lean towards using the numactl” command due to future-proofing yourself for the mostly inevitable addition of NUMA awareness. On CentOS 7+ you may need to install the numactl” yum/rpm package.

To make mongod start using interleaved-mode, add numactl –interleave=all” before your regular mongod” command:

$ numactl --interleave=all mongod <options here>

To check mongod’s NUMA setting:

$ sudo numastat -p $(pidof mongod) Per-node process memory usage (in MBs) for PID 7516 (mongod) Node 0 Total --------------- --------------- Huge 0.00 0.00 Heap 28.53 28.53 Stack 0.20 0.20 Private 7.55 7.55 ---------------- --------------- --------------- Total 36.29 36.29

If you see only 1 x NUMA-node column (“Node0”) NUMA is disabled. If you see more than 1 x NUMA-node, make sure the metric numbers (Heap”, etc.) are balanced between nodes. Otherwise, NUMA is NOT in “interleave” mode.

Note: some MongoDB packages already ship logic to disable NUMA in the init/startup script. Check for this using “grep” first. Your hardware or BIOS manual should cover disabling NUMA via the system BIOS.

Block Device IO Scheduler and Read-Ahead

For tuning flexibility, we recommended that MongoDB data sits on its own disk volume, preferably with its own dedicated disks/RAID array. While it may complicate backups, for the best performance you can also dedicate a separate volume for the MongoDB journal to separate it’s disk activity noise from the main data set. The journal does not yet have it’s own config/command-line setting, so you’ll need to mount a volume to the journal” directory inside the dbPath. For example, /var/lib/mongo/journal” would be the journal mount-path if the dbPath was set to /var/lib/mongo”.

Aside from good hardware, the block device MongoDB stores its data on can benefit from 2 x major adjustments:

IO Scheduler

The IO scheduler is an algorithm the kernel will use to commit reads and writes to disk. By default most Linux installs use the CFQ (Completely-Fair Queue) scheduler. This is designed to work well for many general use cases, but with little latency guarantees. Two other popular schedulers are deadline” and noop”. Deadline excels at latency-sensitive use cases (like databases) and noop is closer to no scheduling at all.

We generally suggest using the deadline” IO scheduler for cases where you have real, non-virtualised disks under MongoDB. (For example, a “bare metal” server.) In some cases I’ve seen noop” perform better with certain hardware RAID controllers, however. The difference between deadline” and cfq” can be massive for disk-bound deployments.

If you are running MongoDB inside a VM (which has it’s own IO scheduler beneath it) it is best to use noop” and let the virtualization layer take care of the IO scheduling itself.

Read-Ahead

Read-ahead is a per-block device performance tuning in Linux that causes data ahead of a requested block on disk to be read and then cached into the filesystem cache. Read-ahead assumes that there is a sequential read pattern and something will benefit from those extra blocks being cached. MongoDB tends to have very random disk patterns and often does not benefit from the default read-ahead setting, wasting memory that could be used for more hot data. Most Linux systems have a default setting of 128KB/256 sectors (128KB = 256 x 512-byte sectors). This means if MongoDB fetches a 64kb document from disk, 128kb of filesystem cache is used and maybe the extra 64kb is never accessed later, wasting memory.

For this setting, we suggest a starting-point of 32 sectors (=16KB) for most MongoDB workloads. From there you can test increasing/reducing this setting and then monitor a combination of query performance, cached memory usage and disk read activity to find a better balance. You should aim to use as little cached memory as possible without dropping the query performance or causing significant disk activity.

Both the IO scheduler and read-ahead can be changed by adding a file to the udev configuration at /etc/udev/rules.d”. In this example I am assuming the block device serving mongo data is named /dev/sda” and I am setting “deadline” as the IO scheduler and 16kb/32-sectors as read-ahead:

# set deadline scheduler and 16kb read-ahead for /dev/sda ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"

To check the IO scheduler was applied ([square-brackets] = enabled scheduler):

$ cat /sys/block/sda/queue/scheduler noop [deadline] cfq

To check the current read-ahead setting:

$ sudo blockdev --getra /dev/sda 32

Note: this change should be applied and tested with a full system reboot!

Filesystem and Options

It is recommended that MongoDB uses only the ext4 or XFS filesystems for on-disk database data. ext3 should be avoided due to its poor pre-allocation performance. If you’re using WiredTiger (MongoDB 3.0+) as a storage engine, it is strongly advised that you ONLY use XFS due to serious stability issues on ext4.

Each time you read a file, the filesystems perform an access-time metadata update by default. However, MongoDB (and most applications) does not use this access-time information. This means you can disable access-time updates on MongoDB’s data volume. A small amount of disk IO activity that the access-time updates cause stops in this case.

You can disable access-time updates by adding the flag noatime” to the filesystem options field in the file /etc/fstab” (4th field) for the disk serving MongoDB data:

/dev/mapper/data-mongodb /var/lib/mongo ext4 defaults,noatime 0 0

Use “grepto verify the volume is currently mounted:

$ grep "/var/lib/mongo" /proc/mounts /dev/mapper/data-mongodb /var/lib/mongo ext4 rw,seclabel,noatime,data=ordered 0 0

Note: to apply a filesystem-options change, you must remount (umount + mount) the volume again after stopping MongoDB, or reboot.

Network Stack

Several defaults of the Linux kernel network tunings are either not optimal for MongoDB, limit a typical host with 1000mbps network interfaces (or better) or cause unpredictable behavior with routers and load balancers. We suggest some increases to the relatively low throughput settings (net.core.somaxconn and net.ipv4.tcp_max_syn_backlog) and a decrease in keepalive settings, seen below.

Make these changes permanent by adding the following to /etc/sysctl.conf” (or a new file /etc/sysctl.d/mongodb-sysctl.conf – if /etc/sysctl.d exists):

net.core.somaxconn = 4096 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_intvl = 30 net.ipv4.tcp_keepalive_time = 120 net.ipv4.tcp_max_syn_backlog = 4096

To check the current values of any of these settings:

$ sysctl net.core.somaxconn net.core.somaxconn = 4096

Note: you must run the command “/sbin/sysctl -p” as root/sudo (or reboot) to apply this change!

NTP Daemon

All of these deeper tunings make it easy to forget about something as simple as your clock source. As MongoDB is a cluster, it relies on a consistent time across nodes. Thus the NTP Daemon should run permanently on all MongoDB hosts, mongos and arbiters included. Be sure to check the time syncing won’t fight with any guest-based virtualization tools like “VMWare tools” and “VirtualBox Guest Additions”.

This is installed on RedHat/CentOS with:

$ sudo yum install ntp

And on Debian/Ubuntu:

$ sudo apt-get install ntp

Note: Start and enable the NTP Daemon (for starting on reboots) after installation. The commands to do this vary by OS and OS version, so please consult your documentation.

Security-Enhanced Linux (SELinux)

Security-Enhanced Linux is a kernel-level security access control module that has an unfortunate tendency to be disabled or set to warn-only on Linux deployments. As SELinux is a strict access control system, sometimes it can cause unexpected errors (permission denied, etc.) with applications that were not configured properly for SELinux. Often people disable SELinux to resolve the issue and forget about it entirely. While implementing SELinux is not an end-all solution, it massively reduces the local attack surface of the server. We recommend deploying MongoDB with SELinus Enforcing” mode on.

The modes of SELinux are:

  1. Enforcing – Block and log policy violations.
  2. Permissive – Log policy violations only.
  3. Disabled – Completely disabled.

As database servers are usually dedicated to one purpose, such as running MongoDB, the work of setting up SELinux is a lot simpler than a multi-use server with many processes and users (such as an application/web server, etc.). The OS access pattern of a database server should be extremely predictable. Introducing Enforcing” mode at the very beginning of your testing/installation instead of after-the-fact avoids a lot of gotchas with SELinux. Logging for SELinux is directed to /var/log/audit/audit.log” and the configuration is at /etc/selinux”.

Luckily, Percona Server for MongoDB RPM packages (CentOS/RedHat) are SELinux “Enforcing” mode compatible as they install/enable an SELinux policy at RPM install time! Debian/Ubuntu SELinux support is still in planning.

Here you can see the SELinux policy shipped in the Percona Server for MongoDB version 3.2 server package:

$ rpm -ql Percona-Server-MongoDB-32-server | grep selinux /etc/selinux/targeted/modules/active/modules/mongod.pp

To change the SELinux mode to Enforcing”:

$ sudo setenforce Enforcing

To check the running SELinux mode:

$ sudo getenforce Enforcing

Linux Kernel and Glibc Version

The version of the Linux kernel and Glibc itself may be more important than you think. Some community benchmarks show a significant improvement on OLTP throughput benchmarks with the recent Linux 3.x kernels versus the 2.6 still widely deployed. To avoid serious bugs, MongoDB should at minimum use Linux 2.6.36 and Glibc 2.13 or newer.

I hope to create a follow-up post on the specific differences seen under MongoDB workloads on Linux 3.2+ versus 2.6. Until then, I recommend you test the difference using your own workloads and any results/feedback are appreciated.

What’s Next?

What’s the next thing to tune? At this point, tuning becomes case-by-case and open-ended. I appreciate any comments on use-case/tunings pairings that worked for you. Also, look out for follow-ups to this article for a few tunings I excluded due to lack of testing.

Not knowing the next step might mean you’re done tuning, or that you need more visibility into your stack to find the next bottleneck. Good monitoring and data visibility are invaluable for this type of investigation. Look out for future posts regarding monitoring your MongoDB (or MySQL) deployment and consider using Percona Monitoring and Management as an all-in-one monitoring solution. You could also try using Percona-Lab/prometheus_mongodb_exporterprometheus/node_exporter and Percona-Lab/grafana_mongodb_dashboards for monitoring MongoDB/Linux with Prometheus and Grafana.

The road to an efficient database stack requires patience, analysis and iteration. Tomorrow a new hardware architecture or change in kernel behavior could come, be the first to spot the next bottleneck! Happy hunting.

Categories: MySQL

Percona XtraDB Cluster 5.7.12 RC1 is now available

MySQL Performance Blog - Thu, 2016-08-11 23:01

Percona announces the first release candidate (RC1) in the Percona XtraDB Cluster 5.7 series on August 9, 2016. Binaries are available from the downloads area or our software repositories.

Percona XtraDB Cluster 5.7.12-5rc1-26.16 is based on the following:

This release includes all changes from upstream releases and the following:

New Features

  • PXC Strict Mode: Use the pxc_strict_mode variable in the configuration file or the –pxc-strict-mode option during mysqld startup.
  • Galera instruments exposed in Performance Schema: This includes mutexes, condition variables, file instances, and threads.

Bug Fixes

  • Fixed error messages.
  • Fixed the failure of SST via mysqldump with gtid_mode=ON.
  • Added check for TOI that ensures node readiness to process DDL+DML before starting the execution.
  • Removed protection against repeated calls of wsrep->pause() on the same node to allow parallel RSU operation.
  • Changed wsrep_row_upd_check_foreign_constraints to ensure that fk-reference-tableis open before marking it open.
  • Fixed error when running SHOW STATUS during group state update.
  • Corrected the return code of sst_flush_tables() function to return a non-negative error code and thus pass assertion.
  • Fixed memory leak and stale pointer due to stats not freeing when toggling the wsrep_providervariable.
  • Fixed failure of ROLLBACK to register wsrep_handler
  • Fixed failure of symmetric encryption during SST.

Other Changes

  • Added support for sending the keyring when performing encrypted SST.
  • Changed the code of THD_PROC_INFO to reflect what the thread is currently doing.
  • Using XtraBackup as the SST method now requires Percona XtraBackup 2.4.4 or later.
  • Improved rollback process to ensure that when a transaction is rolled back, any statements open by the transaction are also rolled back.
  • Removed the sst_special_dirs variable.
  • Disabled switching of slave_preserve_commit_order to ON when running PXC in cluster mode, as it conflicts with existing multi-master commit ordering resolution algorithm in Galera.
  • Based the default my.cnf on Percona Server 5.7 configuration with Galera/wsrep settings from PXC 5.6.
  • Other low-level fixes and improvements for better stability.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Categories: MySQL

Percona Memory Engine for MongoDB

MySQL Performance Blog - Thu, 2016-08-11 17:19

This post discusses Percona Server for MongoDB’s new in-memory storage engine, Percona Memory Engine for MongoDB.

Percona Server for MongoDB introduced the Memory Engine starting with the 3.2.8-2.0 version. To use it, run Percona Server for MongoDB with the --storageEngine=inMemory option.

In-memory is a special configuration of WiredTiger that doesn’t store user data on disk. With this engine, data fully resides in the virtual memory of the system (and might get lost on server shutdown).

Despite the fact that the engine is purely in-memory, it writes a small amount of diagnostic data and statistics to disk. The latter can be controlled with the --inMemoryStatisticsLogDelaySecs option. The --dbpath option controls where to store the files. Generally, in-memory cannot run on the database directory previously used by any other engine (including WiredTiger).

The engine uses the desired amount of memory when configured with the --inMemorySizeGB option. This option takes fractional numbers to allow precise memory size specification. When you reach the specified memory limit, aWT_CACHE_FULL error is returned for all kinds of operations that cause user data size to grow. These include inserting new documents, creating indexes, updating documents by adding or extending fields, running aggregation workflow and others. However, you can still perform read queries on a full engine.

Since Percona Memory Engine executes fewer operations and makes no disk I/O system calls, it performs better compared to conventional durable storage engines, including WiredTiger’s standard disk-based configuration.

Performance

The following graphs show Percona Memory Engine versus WiredTiger performance. Both engines use the default configuration with 140GB cache size specified. The hardware is 56-core Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz with 256GB of RAM and RAID1 2xHDD. Test data set is about cache size and fully fits in memory.

You can clearly see that Percona Memory Engine has better throughput and less jitter on all kinds of workloads. Checkpointing, however, can cause jitters in WiredTiger, and are absent in Percona Memory Engine as there’s no need to periodically sync in-memory data structures with their on-disk representations.

However, the performance of Percona Memory Engine drops when it’s about to become full (currently, when it’s 99% full). We’ve marked this issue as fixed (https://jira.mongodb.org/browse/SERVER-24580) but it still crops up in extreme cases.

Percona Memory Engine might use up to 1.5 times more memory above the set configuration when it’s close to full. WiredTiger almost never exceeds the specified cache memory limit. This might change in future versions. But current users should avoid possible swapping or OOM-killing of the server with Percona Memory Engine if (mis)configured to use all or close to all of available system RAM.

You can download the latest version of Percona Server for MongoDB, which includes the new Percona Memory Engine feature, here.

Categories: MySQL

Percona Server for MongoDB 3.2.8-2.0 is now available

MySQL Performance Blog - Thu, 2016-08-11 15:48

Percona announces the release of Percona Server for MongoDB 3.2.8-2.0 on August 11, 2016. Download the latest version from the Percona web site or the Percona Software Repositories.

Percona Server for MongoDB 3.2.8-2.0 is an enhanced, open-source, fully compatible, highly scalable, zero-maintenance downtime database supporting the MongoDB v3.2 protocol and drivers. It extends MongoDB with MongoRocks, Percona Memory Engine, and PerconaFT storage engine, as well as enterprise-grade features like external authentication and audit logging at no extra cost. Percona Server for MongoDB requires no changes to MongoDB applications or code.

Note:

We deprecated the PerconaFT storage engine. It will not be available in future releases.

This release is based on MongoDB 3.2.8, and includes the following additional changes:

  • Introducing the new Percona Memory Engine, which is based on a special configuration of WiredTiger that stores data in memory instead of the disk.
  • --auditDestination can now be set to file, syslog, or console.
  • --auditFormat can now be set to JSON or BSON.

    Note

    For more information, see Audit Logging.

  • The MongoRocks engine now supports LZ4 compression. This is an upstream feature of MongoRocks contributed by Percona.To enable it, use the --rocksdbCompression option when running PSMDB with the MongoRocks storage engine. For example:
    ./mongod --dbpath=./data --storageEngine=rocksdb --rocksdbCompression=lz4
    For a high-compression variant of LZ4:
    ./mongod --dbpath=./data --storageEngine=rocksdb --rocksdbCompression=lz4hc

    Note

    If you want to configure this permanently, set the following parameters in the /etc/mongod.conf file:

    storage: engine: rocksdb rocksdb: compression: lz4

The release notes are available in the official documentation.

 

Categories: MySQL

Introducing Percona Memory Engine for MongoDB

MySQL Performance Blog - Thu, 2016-08-11 15:13

I’m pleased to announce the latest Percona Server for MongoDB feature: Percona Memory Engine for MongoDB.

Everybody understands that memory is much faster than disk – even the fastest solid state storage can’t compete with it. As such the choice for the most demanding workloads, where performance and predictable latency are paramount, is in-memory computing.

MongoDB is no exception. MongoDB can benefit from a storage engine option that stores data in memory. In fact, MongoDB introduced it in the 3.2 release with their In-Memory Storage Engine. Unfortunately, their engine is only available in their closed source MongoDB Enterprise Edition. Users of their open source MongoDB Community Edition were out of luck. Until now.

At Percona we strive to provide the best open source MongoDB variant software with Percona Server for MongoDB. To meet this goal, we spent the last few months working on an open source implementation of an in-memory storage engine: introducing Percona Memory Engine for MongoDB!

Percona Memory Engine for MongoDB provides the same performance gains as the current implementation of MongoDB’s in-memory engine. Both are based on WiredTiger, but optimize it for cases where data fits in memory and does not need to be persistent.

To make migrating from MongoDB Enterprise Edition to Percona Server for MongoDB as simple as possible, we made our command line and configuration options as compatible as possible with the MongoDB In-Memory Storage Engine.

Look for more blog posts showing the performance advantages of Percona Memory Engine for MongoDB compared to conventional disk-based engines, as well as some use cases and best practices for using Percona Memory Engine in your MongoDB deployments. Below is a quick list of advantages that in-memory processing provides:

  • Reduced costs. Storing data in memory means you do not have to have additional costs for high-performance storage, which provides a great advantage for cloud systems (where high-performance storage comes at a premium).
  • Very high performance reads. In-memory processing provides highly predictable latency as all reads come from memory instead of being pulled from a disk.
  • Very high performance writes. In-memory processing removes the need for persisted data on disk, which very useful for cases where data durability is not critical.

From a developer standpoint, Percona Memory Engine addresses several practical use cases:

  • Application cache. Replace services such as memcached and custom application-level data structures with the full power of MongoDB features.
  • Sophisticated data manipulation. Augment performance for data manipulation operations such as aggregation and map reduction.
  • Session management. Decrease application response times by keeping active user sessions in memory.
  • Transient Runtime State. Store application stateful runtime data that doesn’t require on-disk storage.
  • Real-time Analytics. Use in-memory computing in situations where response time is more critical than persistence.
  • Multi-tier object sharing. Facilitate data sharing in multi-tier/multi-language applications.
  • Application Testing. Reduce turnaround time for automated application tests.

I’m including a simple benchmark result for very intensive write workloads that compares Percona Memory Engine and WiredTiger. As you can see, you can get dramatically better performance with Percona Memory Engine!

Download Percona Memory Engine for MongoDB here.

Categories: MySQL

Small innodb_page_size as a performance boost for SSD

MySQL Performance Blog - Thu, 2016-08-11 00:13

In this blog post, we’ll discuss how a small innodb_page_size can create a performance boost for SSD.

In my previous post Testing Samsung storage in tpcc-mysql benchmark of Percona Server I compared different Samsung devices. Most solid state drives (SSDs) use 4KiB as an internal page size, and the InnoDB default page size is 16KiB. I wondered how using a different innodb_page_size might affect the overall performance.

Fortunately, MySQL 5.7 comes with the option innodb_page_size, so you can set different InnoDB page sizes than the standard 16KiB. This option is still quite inconvenient to use, however. You can’t change innodb_page_size for the existing database. Instead, you need to create a brand new database with a different innodb_page_size and reload whole data set. This is a serious showstopper for production adoption. Specifying innodb_page_size for individual tables or indexes would be a welcome addition, and you could change it with a simple ALTER TABLE foo page_size=4k.

Anyway, this doesn’t stop us from using innodb_page_size=4k in the testing environment. Let’s see how it affects the results using the same conditions described in my previous post.

Again we see that the PM1725 outperforms the SM863 when we have a limited memory, and the result is almost equal when we have plenty of memory.

But what about innodb_page_size 4k vs 16k.?

Here is a direct comparison chart:

Tabular results (in NOTPM, more is better):

Buffer Pool, GiB pm1725_16k pm1725_4k sam850_16k sam850_4k sam863_16k sam863_4k pm1725 4k/16k 5 42427.57 73287.07 1931.54 2682.29 14709.69 48841.04 1.73 15 78991.67 134466.86 2750.85 6587.72 31655.18 93880.36 1.70 25 108077.56 173988.05 5156.72 10817.23 56777.82 133215.30 1.61 35 122582.17 195116.80 8986.15 11922.59 93828.48 164281.55 1.59 45 127828.82 209513.65 12136.51 20316.91 123979.99 192215.27 1.64 55 130724.59 216793.99 19547.81 24476.74 127971.30 212647.97 1.66 65 131901.38 224729.32 27653.94 23989.01 131020.07 220569.86 1.70 75 133184.70 229089.61 38210.94 23457.18 131410.40 223103.07 1.72 85 133058.50 227588.18 39669.90 24400.27 131657.16 227295.54 1.71 95 133553.49 226241.41 39519.18 24327.22 132882.29 223963.99 1.69 105 134021.26 224831.81 39631.03 24273.07 132126.29 222796.25 1.68 115 134037.09 225632.80 39469.34 24073.36 132683.55 221446.90 1.68

 

It’s interesting to see that 4k pages help to improve the performance up to 70%, but only for the PM1725 and SM863. For the low-end Samsung 850 Pro, using a 4k innodb_page_size actually makes things worse when using a high amount of memory.

I think a 70% performance gain is too significant to ignore, even if manipulating innodb_page_size requires extra work. I think it is worthwhile to evaluate if using different innodb_page_size settings help a fast SSD under your workload.

And hopefully MySQL 8.0 makes it easier to use different page sizes!

Categories: MySQL

tpcc-mysql benchmark tool: less random with multi-schema support

MySQL Performance Blog - Tue, 2016-08-09 22:34

In this blog post, I’ll discuss changes I’ve made to the tpcc-mysql benchmark tool. These changes make it less random and support multi-schema.

This post might only be interesting to performance researchers. The tpcc-mysql benchmark to is what I use to test different hardware (as an example, see my previous post: https://www.percona.com/blog/2016/07/26/testing-samsung-storage-in-tpcc-mysql-benchmark-percona-server/).

The first change is support for multiple schemas, rather than just one schema. Supporting only one schema creates too much internal locking in MySQL on the same rows or the same index. Locking is fine if we want to compare different MySQL server versions. But it limits comparing different hardware or Linux kernels. In this case, we want to push MySQL as much as possible to load the underlying components. One solution is to partition several tables, But since MySQL still does not support Foreign Keys over partitioning tables, we would need to remove Foreign Key as well. A better solution is using multiple schemas (which is sort of like artificial partitioning). I’ve implemented this updated in the latest code of tpcc-mysql: https://github.com/Percona-Lab/tpcc-mysql.

The second change I proposed is replacing fully random text fields with generated text, something similar to what is used in the TPC-H benchmark. The problem with fully random strings is that they take a majority of the space in tpcc-mysql schemas, but they are aren’t at all compressible. This makes it is hard to use tpcc-mysql to compare compression methods in InnoDB (as well as different compression algorithms). This implementation is available in a different branch for now: https://github.com/Percona-Lab/tpcc-mysql/tree/less_random.

If you are using tpcc-mysql, please test these changes.

Categories: MySQL

Webinar Thursday 8/11 at 10 am: InnoDB Troubleshooting

MySQL Performance Blog - Tue, 2016-08-09 20:48

Join Sveta Smirnova Thursday, August 11 at 10 am PDT (UTC-7) for a webinar on InnoDB Troubleshooting.

InnoDB is one of the most popular database engines. This general-purpose storage engine is widely used, has been MySQL’s default engine since version 5.6, and holds MySQL system tables since 5.7. It is hard to find a MySQL installation that doesn’t have at least one InnoDB table.

InnoDB is not a simple engine. It has its own locks, transactions, log files, monitoring, options and more. It is also under active development. Some of the latest features introduced in 5.6 are read-only transactions and multiple buffer pools (which now can persist on the disk between restarts). In 5.7, InnoDB added spatial indexes and general tablespaces (which can be created to hold table data per user choice). InnoDB development continues forward today.

Its features provide a great deal of power for users, but at the same time make troubleshooting a complex task.

This webinar will try to make InnoDB troubleshooting easier. You will learn specific tools in InnoDB, how and when to use them, how to get useful information from numerous InnoDB metrics and how to decode the engine status.

Register for this webinar here.

Sveta Smirnova, Principal Technical Services Engineer

Sveta joined Percona in 2015. Her main professional interests are problem-solving, working with tricky issues, bugs, finding patterns that can solve typical issues quicker and teaching others how to deal with MySQL issues, bugs and gotchas effectively. Before joining Percona Sveta worked as Support Engineer in MySQL Bugs Analysis Support Group in MySQL AB-Sun-Oracle. She is the author of book “MySQL Troubleshooting” and JSON UDF functions for MySQL.

Categories: MySQL

Docker Images for MySQL Group Replication 5.7.14

MySQL Performance Blog - Mon, 2016-08-08 16:06

In this post, I will point you to Docker images for MySQL Group Replication testing.

There is a new release of MySQL Group Replication plugin for MySQL 5.7.14. It’s a “beta” plugin and it is probably the last (or at lease one of the final pre-release packages) before Group Replication goes GA (during Oracle OpenWorld 2016, in our best guess).

Since it is close to GA, it would be great to get a better understanding of this new technology. Unfortunately, MySQL Group Replication installation process isn’t very user-friendly.

Or, to put it another way, totally un-user-friendly! It consists of a mere “50 easy steps” – by which I think they mean “easy” to mess up.

Matt Lord, in his post http://mysqlhighavailability.com/mysql-group-replication-a-quick-start-guide/, acknowledges: “getting a working MySQL service consisting of 3 Group Replication members is not an easy “point and click” or automated single command style operation.”

I’m not providing a review of MySQL Group Replication 5.7.14 yet – I need to play around with it a lot more. To make this process easier for myself, and hopefully more helpful to you, I’ve prepared Docker images for the testing of MySQL Group Replication.

Docker Images

To start the first node, run:

docker run -d --net=cluster1 --name=node1 perconalab/mysql-group-replication --group_replication_bootstrap_group=ON

To join all following nodes:

docker run -d --net=cluster1 --name=node2 perconalab/mysql-group-replication --group_replication_group_seeds='node1:6606'

Of course, you need to have Docker Network running:

docker network create cluster1

I hope this will make the testing process easier!

Categories: MySQL

Percona XtraDB Cluster on Ceph

MySQL Performance Blog - Thu, 2016-08-04 22:31

This post discusses how XtraDB Cluster and Ceph are a good match, and how their combination allows for faster SST and a smaller disk footprint.

My last post was an introduction to Red Hat’s Ceph. As interesting and useful as it was, it wasn’t a practical example. Like most of the readers, I learn about and see the possibilities of technologies by burning my fingers on them. This post dives into a real and novel Ceph use case: handling of the Percona XtraDB Cluster SST operation using Ceph snapshots.

If you are familiar with Percona XtraDB Cluster, you know that a full state snapshot transfer (SST) is required to provision a new cluster node. Similarly, SST can also be triggered when a cluster node happens to have a corrupted dataset. Those SST operations consist essentially of a full copy of the dataset sent over the network. The most common SST methods are Xtrabackup and rsync. Both of these methods imply a significant impact and load on the donor while the SST operation is in progress.

For example, the whole dataset will need to be read from the storage and sent over the network, an operation that requires a lot of IO operations and CPU time. Furthermore, with the rsync SST method, the donor is under a read lock for the whole duration of the SST. Consequently, it can take no write operations. Such constraints on SST operations are often the main motivations beyond the reluctance of using Percona XtraDB cluster with large datasets.

So, what could we do to speed up SST? In this post, I will describe a method of performing SST operations when the data is not local to the nodes. You could easily modify the solution I am proposing for any non-local data source technology that supports snapshots/clones, and has an accessible management API. Off the top of my head (other than Ceph) I see AWS EBS and many SAN-based storage solutions as good fits.

The challenges of clone-based SST

If we could use snapshots and clones, what would be the logical steps for an SST? Let’s have a look at the following list:

  1. New node starts (joiner) and unmounts its current MySQL datadir
  2. The joiner and asks for an SST
  3. The donor creates a consistent snapshot of its MySQL datadir with the Galera position
  4. The donor sends to the joiner the name of the snapshot to use
  5. The joiner creates a clone of the snapshot name provided by the donor
  6. The joiner mounts the snapshot clone as the MySQL datadir and adjusts ownership
  7. The joiner initializes MySQL on the mounted clone

As we can see, all these steps are fairly simple, but hide some challenges for an SST method base on cloning. The first challenge is the need to mount the snapshot clone. Mounting a block device requires root privileges – and SST scripts normally run under the MySQL user. The second challenge I encountered wasn’t expected. MySQL opens the datadir and some files in it before the SST happens. Consequently, those files are then kept opened in the underlying mount point, a situation that is far from ideal. Fortunately, there are solutions to both of these challenges as we will see below.

SST script

So, let’s start with the SST script. The script is available in my Github at:

https://github.com/y-trudeau/ceph-related-tools/raw/master/wsrep-sst/wsrep_sst_ceph

You should install the script in the /usr/bin directory, along with the other user scripts. Once installed, I recommend:

chown root.root /usr/bin/wsrep_sst_ceph chmod 755 /usr/bin/wsrep_sst_ceph

The script has a few parameters that can be defined in the [sst] section of the my.cnf file.

cephlocalpool
The Ceph pool where this node should create the clone. It can be a different pool from the one of the original dataset. For example, it could have a replication factor of 1 (no replication) for a read scaling node. The default value is: mysqlpool
cephmountpoint
What mount point to use. It defaults to the MySQL datadir as provided to the SST script.
cephmountoptions
The options used to mount the filesystem. The default value is: rw,noatime
cephkeyring
The Ceph keyring file to authenticate against the Ceph cluster with cephx. The user under which MySQL is running must be able to read the file. The default value is: /etc/ceph/ceph.client.admin.keyring
cephcleanup
Whether or not the script should cleanup the snapshots and clones that are no longer is used. Enable = 1, Disable = 0. The default value is: 0
Root privileges

In order to allow the SST script to perform privileged operations, I added an extra SST role: “mount”. The SST script on the joiner will call itself back with sudo and will pass “mount” for the role parameter. To allow the elevation of privileges, the follow line must be added to the /etc/sudoers file:

mysql ALL=NOPASSWD: /usr/bin/wsrep_sst_ceph

Files opened by MySQL before the SST

Upon startup, MySQL opens files at two places in the code before the SST completes. The first one is in the function mysqld_main , which sets the current working directory to the datadir (an empty directory at that point).  After the SST, a block device is mounted on the datadir. The issue is that MySQL tries to find the files in the empty mount point directory. I wrote a simple patch, presented below, and issued a pull request:

diff --git a/sql/mysqld.cc b/sql/mysqld.cc index 90760ba..bd9fa38 100644 --- a/sql/mysqld.cc +++ b/sql/mysqld.cc @@ -5362,6 +5362,13 @@ a file name for --log-bin-index option", opt_binlog_index_name); } } } + + /* + * Forcing a new setwd in case the SST mounted the datadir + */ + if (my_setwd(mysql_real_data_home,MYF(MY_WME)) && !opt_help) + unireg_abort(1); /* purecov: inspected */ + if (opt_bin_log) { /*

With this patch, I added a new my_setwd call right after the SST completed. The Percona engineering team approved the patch, and it should be added to the upcoming release of Percona XtraDB Cluster.

The Galera library is the other source of opened files before the SST. Here, the fix is just in the configuration. You must define the base_dir Galera provider option outside of the datadir. For example, if you use /var/lib/mysql as datadir and cephmountpoint, then you should use:

wsrep_provider_options="base_dir=/var/lib/galera"

Of course, if you have other provider options, don’t forget to add them there.

Walkthrough

So, what are the steps required to use Ceph with Percona XtraDB Cluster? (I assume that you have a working Ceph cluster.)

1. Join the Ceph cluster

The first thing you need is a working Ceph cluster with the needed CephX credentials. While the setup of a Ceph cluster is beyond the scope of this post, we will address it in a subsequent post. For now, we’ll focus on the client side.

You need to install the Ceph client packages on each node. On my test servers using Ubuntu 14.04, I did:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add - sudo apt-add-repository 'deb http://download.ceph.com/debian-infernalis/ trusty main' apt-get update apt-get install ceph

These commands also installed all the dependencies. Next, I copied the Ceph cluster configuration file /etc/ceph/ceph.conf:

[global] fsid = 87671417-61e4-442b-8511-12659278700f mon_initial_members = odroid1, odroid2 mon_host = 10.2.2.100, 10.2.2.20, 10.2.2.21 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd_journal = /var/lib/ceph/osd/journal osd_journal_size = 128 osd_pool_default_size = 2

and the authentication file /etc/ceph/ceph.client.admin.keyring from another node. I made sure these files were readable by all. You can define more refined privileges for a production system with CephX, the security layer of Ceph.

Once everything is in place, you can test if it is working with this command:

root@PXC3:~# ceph -s cluster 87671417-61e4-442b-8511-12659278700f health HEALTH_OK monmap e2: 3 mons at {odroid1=10.2.2.20:6789/0,odroid2=10.2.2.21:6789/0,serveur-famille=10.2.2.100:6789/0} election epoch 474, quorum 0,1,2 odroid1,odroid2,serveur-famille mdsmap e204: 1/1/1 up {0=odroid3=up:active} osdmap e995: 4 osds: 4 up, 4 in pgmap v275501: 1352 pgs, 5 pools, 321 GB data, 165 kobjects 643 GB used, 6318 GB / 7334 GB avail 1352 active+clean client io 16491 B/s rd, 2425 B/s wr, 1 op/s

Which gives the current state of the Ceph cluster.

2. Create the Ceph pool

Before we can use Ceph, we need to create a first RBD image, put a filesystem on it and mount it for MySQL on the bootstrap node. We need at least one Ceph pool since the RBD images are stored in a Ceph pool.  We create a Ceph pool with the command:

ceph osd pool create mysqlpool 512 512 replicated

Here, we have defined the pool mysqlpool with 512 placement groups. On a larger Ceph cluster, you might need to use more placement groups (again, a topic beyond the scope of this post). The pool we just created is replicated. Each object in the pool will have two copies as defined by the osd_pool_default_size parameter in the ceph.conf file. If needed, you can modify the size of a pool and its replication factor at any moment after the pool is created.

3. Create the first RBD image

Now that we have a pool, we can create a first RBD image:

root@PXC1:~# rbd -p mysqlpool create PXC --size 10240 --image-format 2

and “map” the RBD image to a host block device:

root@PXC1:~# rbd -p mysqlpool map PXC /dev/rbd1

The commands return the local RBD block device that corresponds to the RBD image. The other steps are not specific to RBD images, we need to create a filesystem and prepare the mount points.

The rest of the steps are not specific to RBD images. We need to create a filesystem and prepare the mount points:

mkfs.xfs /dev/rbd1 mount /dev/rbd1 /var/lib/mysql -o rw,noatime,nouuid chown mysql.mysql /var/lib/mysql mysql_install_db --datadir=/var/lib/mysql --user=mysql mkdir /var/lib/galera chown mysql.mysql /var/lib/galera

You need to mount the RBD device and run the mysql_install_db tool only on the bootstrap node. You need to create the directories /var/lib/mysql and /var/lib/galera on the other nodes and adjust the permissions similarly.

4. Modify the my.cnf files

You will need to set or adjust the specific wsrep_sst_ceph settings in the my.cnf file of all the servers. Here are the relevant lines from the my.cnf file of one of my cluster node:

[mysqld] wsrep_provider=/usr/lib/libgalera_smm.so wsrep_provider_options="base_dir=/var/lib/galera" wsrep_cluster_address=gcomm://10.0.5.120,10.0.5.47,10.0.5.48 wsrep_node_address=10.0.5.48 wsrep_sst_method=ceph wsrep_cluster_name=ceph_cluster [sst] cephlocalpool=mysqlpool cephmountoptions=rw,noatime,nodiratime,nouuid cephkeyring=/etc/ceph/ceph.client.admin.keyring cephcleanup=1

At this point, we can bootstrap the cluster on the node where we mounted the initial RBD image:

/etc/init.d/mysql bootstrap-pxc

5. Start the other XtraDB Cluster nodes

The first node does not perform an SST, so nothing exciting so far. With the patched version of MySQL (the above patch), starting MySQL on a second node triggers a Ceph SST operation. In my test environment, the SST take about five seconds to complete on low-powered VMs. Interestingly, the duration is not directly related to the dataset size. Because of this, a much larger dataset, on a quiet database, should take about the exact same time. A very busy database may need more time, since an SST requires a “flush tables with read lock” at some point.

So, after their respective Ceph SST, the other two nodes have:

root@PXC2:~# mount | grep mysql /dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid) root@PXC2:~# rbd showmapped id pool image snap device 1 mysqlpool PXC2-1463776424 - /dev/rbd1 root@PXC3:~# mount | grep mysql /dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid) root@PXC3:~# rbd showmapped id pool image snap device 1 mysqlpool PXC3-1464118729 - /dev/rbd1

The original RBD image now has two snapshots that are mapped to the clones mounted by other two nodes:

root@PXC3:~# rbd -p mysqlpool ls PXC PXC2-1463776424 PXC3-1464118729 root@PXC3:~# rbd -p mysqlpool info PXC2-1463776424 rbd image 'PXC2-1463776424': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.108b4246146651 format: 2 features: layering flags: parent: mysqlpool/PXC@1463776423 overlap: 10240 MB

Discussion

Apart from allowing faster SST, what other benefits do we get from using Ceph with Percona XtraDB Cluster?

The first benefit is the inherent data duplication over the network removes the need for local data replication. Thus, instead of using raid-10 or raid-5 with an array of disks, we could use a simple raid-0 stripe set if the data is already replicated to more than one server.

The second benefit is a bit less obvious: you don’t need as much storage. Why? A Ceph clone only stores the delta from its original snapshot. So, for large, read intensive datasets, the disk space savings can be very significant. Of course, over time, the clone will drift away from its parent snapshot and will use more and more space. When we determine that a Ceph clone uses too much disk space, we can simply refresh the clone by restarting MySQL and forcing a full SST. The SST script will automatically drop the old clone and snapshot when the cephcleanup option is set, and it will create a new fresh clone. You can easily evaluate how much space is consumed by the clone using the following commands:

root@PXC2:~# rbd -p mysqlpool du PXC2-1463776424 warning: fast-diff map is not enabled for PXC2-1463776424. operation may be slow. NAME PROVISIONED USED PXC2-1463776424 10240M 164M

Also, nothing prevents you using a different configuration of Ceph pools in the same XtraDB cluster. Therefore a Ceph clone can use a different pool than its parent snapshot. That’s the whole purpose of the cephlocalpool parameter. Strictly speaking, you only need one node to use a replicated pool, as the other nodes could run on clones that are stored data in a non-replicated pool (saving a lot of storage space). Furthermore, we can define the OSD affinity of the non-replicated pool in a way that it stores data on the host where it is used, reducing the cross node network latency.

Using Ceph for XtraDB Cluster SST operation demonstrates one of the array of possibilities offered to MySQL by Ceph. We continue to work with the Red Hat team and Red Hat Ceph Storage architects to find new and useful ways of addressing database issues in the Ceph environment. There are many more posts to come, so stay tuned!

DISCLAIMER: The wsrep_sst_ceph script isn’t officially supported by Percona.
Categories: MySQL

Testing Docker multi-host network performance

MySQL Performance Blog - Wed, 2016-08-03 19:26

In this post, I’ll review Docker multi-host network performance.

In a past post, I tested Docker network. The MySQL Server team provided their own results, which are in line with my observations.

For this set of tests, I wanted to focus more on Docker networking using multiple hosts. Mostly because when we set up a high availability (HA) environment (using Percona XtraDB Cluster, for example) the expectation is that instances are running on different hosts.

Another reason for this test is that Docker recently announced the 1.12 release, which supports Swarm Mode. Swarm Mode is quite interesting by itself — with this release, Docker targets going deeper on Orchestration deployments in order to compete with Kubernetes and Apache Mesos. I would say Swarm Mode is still rough around the edges (expected for a first release), but I am sure Docker will polish this feature in the next few releases.

Swarm Mode also expects that you run services on different physical hosts, and services are communicated over Docker network. I wanted to see how much of a performance hit we get when we run over Docker network on multiple hosts.

Network performance is especially important for clustering setups like Percona XtraDB Cluster and  MySQL Group Replication (which just put out another Lab release).

For my setup, I used two physical servers connected over a 10GB network. Both servers use 56 cores total of Intel CPUs.

Sysbench setup: data fits into memory, and I will only use primary key lookups. Testing over the network gives the worst case scenario for network round trips, but it also gives a good visibility on performance impacts.

The following are options for Docker network:

  • No Docker containers (marked as “direct” in the following results)
  • Docker container uses “host” network (marked as “host”)
  • Docker container uses “bridge” network, where service port exposed via port forwarding (marked as “bridge”)
  • Docker container uses “overlay” network, both client and server are started in containers connected via overlay network (marked as “overlay” in the results). For “overlay” network it is possible to use third-party plugins, with different implementation of the network, the most known are:

For multi-host networking setup, only “overlay” (and plugins implementations) are feasible. I used “direct”, “host” and “bridge” only for the reference and as a comparison to measure the overhead of overlay implementations.

The results I observed are:

Client Server Throughput, tps Ratio to “direct-direct” Direct Direct 282780 1.0 Direct Host 280622 0.99 Direct Bridge 250104 0.88 Bridge Bridge 235052 0.83 overlay overlay 120503 0.43 Calico overlay Calico overlay 246202 0.87 Weave overlay Weave overlay 11554 0.044

 

Observations
  • “Bridge” network added overhead, about 12%, which is in line with my previous benchmark. I wonder, however, if this is Docker overhead or just the Linux implementation of bridge networks. Docker should be using the setup that I described in Running Percona XtraDB Cluster nodes with Linux Network namespaces on the same host, and I suspect that the Linux network namespaces and bridges add overhead. I need to do more testing to verify.
  • Native “Overlay” Docker network struggled from performance problems. I observed issues with ksoftirq using 100% of one CPU core, and I see similar reports. It seems that network interruptions in Docker “overlay” are not distributed properly across multiple CPUs. This is not the case with the “direct” and “bridge” configuration. I believe this is a problem with the Docker “overlay” network (hopefully, it will eventually be fixed).
  • Weave network showed absolutely terrible results. I see a lot of CPU allocated to “weave” containers, so I think there are serious scalability issues in their implementation.
  • Calico plugin showed the best result for multi-host containers, even better than “bridge-bridge” network setup
Conclusion

If you need to use Docker “overlay” network — which is a requirement if you are looking to deploy a multi-host environment or use Docker Swarm mode — I recommend you consider using the Calico network plugin for Docker. Native Docker “overlay” network can be used for prototype or quick testing cases, but at this moment it shows performance problems on high-end hardware.

 

Categories: MySQL

Take Percona’s One-Click Database Security Downtime Poll

MySQL Performance Blog - Tue, 2016-08-02 22:12

Take Percona’s database security downtime poll.

As Peter Zaitsev mentioned recently in his blog post on database support, the data breach costs can hit both your business reputation and your bottom line. Costs vary depending on the company size and market, but recent studies estimate direct costs ranging in average from $1.6M to 7.01M. Everyone agrees leaving rising security risks and costs unchecked is a recipe for disaster.

Reducing security-based outages doesn’t have a simple answer, but can be a combination of internal and external monitoring, support contracts, enhanced security systems, and a better understanding of security configuration settings.

Please take a few seconds and answer the following poll. It will help the community get an idea of how security breaches can impact their critical database environments.

If you’ve faced  specific issues, feel free to comment below. We’ll post a follow-up blog with the results!

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

You can see the results of our last blog poll on high availability here.

Categories: MySQL

High Availability Poll Results

MySQL Performance Blog - Tue, 2016-08-02 22:11

This blog reports the results of Percona’s high availability poll.

High availability (HA) is always a hot topic. The reality is that if your data is not available, your customers cannot do business with you. In fact, estimates show the average cost of downtime is about $5K per minute. With an average outage taking 40 minutes to correct, you could be looking at a potential cost of $200K if your MySQL instance goes down. Whether your database is on premise, or in public or private clouds, it is critical that your database deployment does not have a potentially devastating single point of failure.

The results from Percona’s high availability poll responses are in:

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

With over 700 unique participants and 844 different selections, MySQL replication was the clear frontrunner when it comes to high availability solutions.

Percona has HA solutions available, come find out more at our website.

If you’re using other solutions or have specific issues, feel free to comment below.

Check out the latest Percona one-click poll on database security here.

Categories: MySQL
Syndicate content