MySQL/Aggregation: Comparing COUNT(DISTINCT) values on big tables

Sometimes one needs to work with rather big database tables and with COUNT(DISTINCT) statements. Recently, I had to analyze the result sets of COUNT(DISTINCT) statements for table consistency checks. These tasks required subselects or subqueries. The SQL statements were performed on a test table with more than 5 million records. The data records – together with data of another table – formed the data basis of several complicated computational runs. The eventual record number in our application will be much larger, but some optimization principles can be learned already from 5 million example.

Most of the computed COUNT(DISTINCT) values referred to some of the table’s key columns. Our data records have a combination of three different key fields, which together define a unique key. The combined unique key has a sub-structure corresponding to physical aspects of the real world data objects behind the records. This induces a natural grouping of records.

Such a data structure occurs very often in physical simulations: Think e.g. of a grid of meteorological observation stations that measure environment data at distinct points of a time line. Each station would then be characterized by 2 positional data. The measured quantity values would have a three fold unique key : two geometrical position data and a time value. Then the data for temperature, pressure over time in a database table could be aggregated and grouped according to geometrical positions, i.e. for the stations and their locations.

In my case COUNT(DISTINCT) values had to be computed over one key column and for groups defined by distinct combinations of other keys. The results should be compared with a reference value: the COUNT (DISTINCT) result for the whole table. I shall explain the detailed key structure and the comparison objectives below. Some of the counted distinct values also had to be compared with similar values of another, but much smaller table that contained only around 14000 records for a reduced set of the key columns. The comparisons were part of consistency checks concerning record groups inside the huge table and concerning the equality of distinct key values used in the huge table and the smaller table.

When working on these tasks I first tried to optimize the required sub-select statements for the COUNT(DISTINCT) evaluation on my key columns by defining and using proper indices. After some more analysis, I extended my optimization strategy to creating a helper table instead of working with intermediate but volatile “derived result sets” of my sub-selects. Read below, why …

My case is a little special, but consistency checks regarding the number of records in certain record groups which correspond to distinguished combinations of some keys occur in very many environments (physical simulation, statistics, neuronal networks, ..). Therefore, I hope that the insights I got from my simple scenario may be helpful for other people who have to deal with COUNT(DISTINCT) statements on big tables, too.

The playground, data key structure and some numbers

In our project we load the data of a sequence of csv-files (each with millions of records) into 2 tables of a database. After each import process we check some consistency conditions the records of two tables must fulfill. The reason is simply that we want to avoid interrupted or crashing computation runs due to a lack of data or inconsistent data provided by our tables.

Each record in our big table contains data values that depend on a 3 fold key-structure:

Two keys [m,n] describe certain objects by 2 dimensions in an abstract space. Let us call such an object a “node”. For a “node” some hundred values of several quantities are defined with respect to an additional integer key “[i]” of a further dimension.

The table structure
of our first table “TA” is flat – something like

nr,    m, n, i    Q1, Q2, Q3, ….

with the Qs representing some quantities and the first column being a auto-indexed one. The number distribution of the key values in our case is as follows:

m has only some few distinct values (2 to 10), whereas n may have distinct values in a range between 10000 up to 100000. Distinct i values are in the range between 200 and 1000. In my concrete test table I chose:

  • distinct m values : 1 (as an extreme situation)
  • distinct n values : 13580
  • distinct i values : 386

There is another table “TB” with keys [m,n] and associated quantity values per object. The data of both tables are later on used to perform several calculations. For a sufficient performance of the computational runs we need at least a combined index over the tupel [m,n] – which should be created already during the data import phase. The funny thing in our scenario is that the time required for the data load and import phase is dominant in comparison with all of our computation runs – right now by a factor of 1,5 to 3. So, we really have no intention of making the import time interval any bigger without very good reasons.

Objectives of COUNT and COUNT (DISTINCT ) statements : consistency checks

The consistency of the imported data in the two tables TA and TB must be guaranteed. Some of the consistency conditions that must be fulfilled are:

  1. The number of distinct values for both [m], [n] and [m,n] must be the same in the tables TA and TB
  2. The number of unique [m,n,i] values should be equal to the total number of entries in table TA.
  3. The number of distinct [i] values (i.e. number of distinct records) for any given [m,n]-key pair (i.e. node) should be equal to the total number of records for that [m,n]-pair.
  4. The number of distinct i values (i.e. number of records) of a given [m,n]-key pair in table TA should have an identical [m,n]-independent value. (For all nodes there must be an identical number of i-dependent records).
  5. The number of distinct i-key values of any [m,n] pair should be identical to the number of distinct i-values given for the full table

The last 3 conditions together (!) guarantee that for each [m,n] combination (=node) the different i values of the associated records

  • are all distinct – with the same sequence of values independent of [m,n];
  • lead to the same number of records for every [m,n]-combination.

The following aspects should be considered:

  • The checks are to be performed several times as the big table would be generated stepwise by a sequence of “LOAD DATA INFILE” processes. E.g. 4 times a 5 million record file. A consistency check should in addition always be performed before a computation run is started.
  • The time required for consistency checks add up to the total data import and preparation time for the tables. The effect should be limited to a tolerable factor below 1.3.
  • We expect indices over the key columns to play an important role for the performance of SELECTS for aggregated distinct key values. Building indices, however, may cost substantial additional time during the data import and preparation phase.
  • The performance of certain SELECT statements may depend on the order of columns used in the index definition.

As the data import and preparation time was already dominant we were eager to avoid any intolerable prolongation of this time period.

The impact
of a unique or non unique index over all key columns on the data import and preparation time

In our case we would create a [m,n] or [n,m] index. Such an index was unavoidable due to performance requirements for the computation runs. An [n,m]-index would give us a slight performance advantage during our calculations compared to an [m,n] index. But this difference is marginal. In the data loading phase we import our data into the MySQL base with “LOAD DATA INFILE” statement issued by a PHP program (see ). The index is filled in parallel. We were interested in whether we could afford a full unique index over all 3 key tables.

Below we give some numbers for the required time intervals of our data imports. These numbers always include a common overhead part. This overhead time is due to an Ajax exchange between browser and server, zip file transfer time across our network, PHP file loading times, zip expansion time on the server, movement of files … So the numbers given below do not reflect the pure data loading time into the base on the server. By estimation this time is at least by 4,5 seconds smaller. The test server itself is a relatively small one in form of a KVM virtualized LAMP server with 2 GByte assigned RAM, CPU INTEL Q9550, Raid 10 disk array. For the following tests I worked with MySQL MyISAM tables – however, the general picture should hold for InnoDB tables, too.

Addendum, 21.09.2014:
Please note in addition that our big table has a first column which is auto-incremented and is used as a primary index. This primary index was always created. It explains why the absolute numbers of data import times given below are relatively big. See a forthcoming article
MySQL: LOAD DATA INFILE, csv-files and index creation for big data tables
for a more complete discussion of the impact of (unique and non-unique) indices on data imports from csv files.

We got the following loading times required to fill table TA with data from a csv-file with m=1, n=13580, i=386 (all distinct) by using “LOAD DATA INFILE”:

  • without [n,m,i]-index creation : 23.5 sec.
  • with [n,m]-index creation (non unique) : 25.3 sec
  • with [n,m,i]-index creation (non unique) : 26.3 sec
  • with [n,m,i]-index creation (unique) : 36.2 sec

(compare with the following blog articles
MySQL/PHP: LOAD DATA – import of large csv files – linearity with record number and
Importing large csv files with PHP into a MySQL MyISAM table )

Note that there is no big difference between creating a full [n,m,i]-index over all key columns and creating an index only for the first two columns of our test table. So a full [n,m,i]- index creation seemed to be affordable. It would not hamper the computation runs, but it helps a little with our consistency checks.

However, there is a substantial difference of loading times for a unique and a non unique index. This difference stems of course from related checks which have to be performed during the implicit INSERTs of the data into the table. As we do not really need a unique index for the computation runs the question arises whether we can just use a plain index and still can verify check condition 2 with a better performance than using 10 extra seconds. The answer is – yes we can do that. See below.

Another remark for reasons of completeness:

An [i,n,m]-index instead of a [n,m,i]-index would double the time required for our computation runs. The branching structure of an index tree is performance relevant! in our scenario this is due to the fact that aggregations (sums, statistical information, … ) over the i-dependent quantities for a node play an important role in the calculation steps. Selects or aggregation for distinct [n,m]-nodes and aggregation processes over [n,m]-related record groups require a [n,m,..] index. In case of doubt about the n,mm-order I would say it is better to start with the more numerous branching. Any initial branch chosen then leads to a smaller amount of remaining records for further analysis. Almost all statements discussed below profit from the chosen column and branching order of the index.

What information is really required?

Looking at condition 2 we would say: Let us get the number of all distinct [n,m,i] triples in the table and compare that with the COUNT(*) value for the table. With our given [n,m,i]-index, we could use the following SQL statement:

“SELECT COUNT(DISTINCT n,m,i) FROM TA” :: 1.9 to 2.0 secs .

If we had a unique index the answer would come much faster. However, with our plain [n,m,i]-index such a statement takes 2 secs – which is still considerably less than 10 secs (< 10 secs!). Our [n,m,i]-index is used already for grouping and sorting as EXPLAIN will tell you. So, most of the time is spend for comparing i values of the sorted groups. If there were no other conditions we would have to use this statement. However, conditions 3 to 5 are very strict. Actually, some thinking shows that if we guarantee conditions 3 to 5 we also have guaranteed that the total number of records is identical to the number of distinct records. If for some given [m,n] combination we have y distinct records than we have y different i-values. The rest follows ... Therefore, we should instead concentrate on getting information about the following numbers:

  1. The total number of distinct i values in the table. This would be used as a reference value “$ref_dist_i”.
  2. The total number of i-values ( = number of records ) “num_i” per [m,n]-node (for all nodes).
  3. The total number of distinct i-values “num_dist_i” per [m,n]-node (for all nodes).
  4. The number of records, for which num_dist_i or num_i deviate from $ref_dist_i.

If we find no deviation than we have also confirmed condition 2.

“Alternative”
There is another way to prove condition 2 if there is an additional precondition:

The sequence of i-values given for each node shall be the same for all nodes.

This condition is given in very many circumstances (e.g. in physics simulations) – also in our case. Then we could look at the difference of

Max(Distinct i) – Min (DISTINCT i) for each node

and demand

  • [Max(Distinct i) – Min (DISTINCT i) + 1] == [COUNT (DISTINCT i)] == [COUNT i] per node
  • plus that [COUNT(i)] has the same value for all nodes.

Please note that we would not need to determine the number of distinct i values in our big table at all for this variant of our consistency check.

We shall see below why this can give us a substantial advantage in terms of SELECT times.

To investigate condition 1 we need additionally:

  • The total number of distinct [n,m]-values, distinct m-values and distinct n-values for the tables TA and TB.

Aggregated distinct key values for groups corresponding to leading branches of an index

Let us now consider
SQL statements that determine distinct [n,m]-values, distinct m-values and distinct n-values of table TA. We make the following trivial assuption: COUNT (DISTINCT ) aggregations over one or several columns are much faster if you can use an index defined for these columns. Why ? Because an index has a tree structure and “knows” about the number of its branches which are determined by unique values! Furthermore, also on subsequent branching levels it can be used for grouping and sorting of records ahead of internal value comparisons.

If this is true we would further assume that getting the number of distinct [n,m] values should be pretty fast, because our [n,m,i]- index can be used. And really :

“SELECT COUNT(DISTINCT n,m) FROM TA” :: 0.051 to 0.067 secs

The same holds for “n” alone:

“SELECT COUNT(DISTINCT n) FROM TA” :: 0.049 to 0.067 secs

The marginal difference is not surprising as there is only one [m]-value. EXPLAIN shows that in both statements above the index is used for internal grouping and sorting.

However :

“SELECT COUNT(DISTINCT m) FROM TA” :: 1.4 to 1.6 secs

What we see here is that the index may still be used for grouping – but for each “n”-branch the “m”-index values still have to be compared. Can we make this evaluation a bit faster? What about a sub-select that creates a [n,m]-basis for further analysis ? This could indeed lead to a better performance as ordered [n,m]-pairs correspond directly to leading branches of our index and because the result set of the subquery will be significantly smaller than the original table:

“SELECT COUNT(DISTINCT a.ma) FROM (SELECT DISTINCT n, m AS ma FROM TA ORDER BY n) as a” :: 0,056 to 0,078 secs

Compared to a more realistic (n,m)-value distribution this result is a bit too positive as we only have one distinct m-value. Nevertheless, we see that we can use our already created [(n,m),i]-index with some considerable advantage to work on parts of condition 1 in the list above. Note, that a temporary table for the derived result set of the sub-select is created; EXPLAIN and time measurements of the MySQL profiler will show that. This intermediate table is substantially smaller than the original table and also the number of distinct [m]-values is small (here 1).

What about table TB and these numbers ? Well, TB is just as small as our intermediate table of the last statement! And because it is (relatively) small, we can afford to create whatever index required there – one for [n], one for [m], one for [n,m]. So, we can get all required node information there with supreme speed.

Summary: We can perform consistency checks for condition 1 without any performance problems. Our [n,m]-index is suitable and sufficient for it.

Two additional lessons were learned:
1) Aggregation for key groups and leading index branches over corresponding key columns should fit to each other.
2) Intermediate temporary tables may help for a required aggregation – if they are much smaller than the original table. This leads to SQL statements with subselects/subqueries of the form

SELECT COUNT(DISTINCT a.x) FROM ( SELECT DISTINCT col1 AS x, col2, .. as FROM ….ORDER BY .. ) as a

Aggregating distinct values of a column which is the last in the column definition order of a combined index

Now, let us turn to the determination of the total number of distinct i values. We need this value as a reference value ($ref_dist_i) if we want to verify conditions 3 to 5 ! The fastest way – if we only have our plain, non-unique [n,m,i]-index to get this number – is a simple

SELECT COUNT(DISTINCT i) FROM table TA : 1.85 – 1,95 secs

nAn EXPLAIN shows that this statement already uses the [n,m,i]-index. I did not find any way to make the process faster with just our plain (non unique) [n,m,i]-index. And note: No intermediate table would help us here – and intermediate table would only be a reordered copy of our original table. Therefore:

“SELECT COUNT(DISTINCT a.ia) FROM (SELECT DISTINCT i AS ia, n, m FROM TA ORDER BY n, m ) as a” :: 11.9 secs

To try to make the determination of COUNT(DISTINCT i) faster we could be tempted to create another index which starts it’s tree branching with the i-column or just a plain index for the i-column. We would build such an index during the data load process. And then we would find in our scenario that the extra time to create such an index would be ca. 2 sec. We would first have to invest time in a proper data analysis – even if a subsequent SELECT COUNT(DISTINCT i) FROM table TA would be pretty fast.

So, if we do not get the number of distinct i-values from elsewhere (e.g. from the provider of our csv-files) – we are stuck to something like an additional overhead of 2 secs – which gives us a choice: Create an additional index OR use the simple SQL statement ?

Note that even the creation of a pure i-index would consume considerable memory space on the hard disk (in my concrete example the required index space grew from 138 MB to 186 MB – this total value being not far from the total amount of data in table TA.) Because, up to now I have not found another usage for an i-index I refrain from creating it. I better save the determined number of distinct i values in a tiny additional parameter table to make it accessible for different PHP programs afterwards.

Comparing aggregated COUNT(DISTINCT) values for record groups with a given reference value

Conditions 3 to 5 of our task list require that we compare the reference value [$ref_val_i] for distinct i values of table TA with both the distinct i values and the sum of all i values for each of all the [m,n]-nodes. This leads us to something like the following SQL statement with a subselect/subquery:

Comparison Statement:

SELECT count(*) FROM ( SELECT COUNT(distinct b.i) AS sum_distinct_i, COUNT(b.i) AS sum_i FROM TA b GROUP BY b.n, b.m HAVING sum_distinct_i != $ref_dist_i OR sum_i != $ref_dist_i ) AS a

It gives us the number of nodes with deviations either of the number of associated distinct i-values or of the number of associated i-values from the reference value (here $ref_dist_i). The complicated statement requires some time.

Comparison statement :: < 2.2 secs

Despite the fact that our [n,m,i]-index is used as an EXPLAIN will show. You won’t get that much faster. The alternative

SELECT count(*) FROM ( SELECT COUNT(distinct b.i) AS sum_distinct_i, COUNT(b.i) AS sum_i FROM TA b GROUP BY b.n, b.m ) AS a WHERE a.sum_distinct_i != $ref_dist_i OR a.sum_i != $ref_dist_i

has the same performance. The slow part is the WHERE or HAVING analysis on the derived temporary result set.

By the way:
Working with an [n,m]-index and a separate [i]-index would slow down the comparison statement to > 2.6 secs. So, here we eventually find a slight advantage of our combined index over three columns (also the computation runs get a bit faster). In addition the combined index takes less disk space than several individual indices.

We now should take into account the following aspect of our scenario: We may call such a comparison statement several times if we load a sequence of data files – each with 5 million data records. Then our “comparison statement” will get slower and slower the more records the table comprises. In addition we may also start every computational run with a consistency check of the tables
and show the results in a browser! Then we need to optimize the response time for the web page. Even 2.2 secs may then be too much. How can we meet this challenge ?

Helper table instead of derived result sets from subselects/subqueries

My answer was to use a helper table “TC” – which receives the data of the derived result set from the sub-select of our comparison statement. In table TC we would store something like

m,    n,    sum_distinct_i,    sum_i

and maybe other node dependent values. This has also the advantage to become flexible in case that some day the distinct i values get [n,m]-dependent and have to be compared to some other values.

Note that the number of values in such a table is considerably smaller than in the original table. So, speaking in relative numbers we could generate some indices for [m,n], m, n, sum_dist_i, sum_i without much investment into cpu time and memory! This would also relieve us from complicated statements to determine the number of distinct m values as discussed above. And statements as

SELECT COUNT(*) FROM TC WHERE sum_distinct != $ref_dist_i OR sum_i != $ref_dest_i

would perform very well. (Under our present conditions this should give a plain zero if consistency is given).

Therefore, the only question remains: What would such a helper table cost regarding time ? The proper statement to fill a prepared table TC with indices is

INSERT INTO TC (m, n, sum_distinct_i, sum_i) SELECT a.m, a.n, COUNT(DISTINCT a.i), COUNT(a.i) FROM TA a GROUP BY a.n, a.m

This takes approximately 2.26 secs. However, an additional

SELECT COUNT(*) as sum_dist FROM TC WHERE sum_distinct_i != $ref_dist_i OR sum_i != $ref_dist_i

takes only 0,002 secs:

  • SELECT COUNT(DISTINCT m) FROM TC :: < 0.002 secs
  • SELECT COUNT(DISTINCT n) FROM TC :: 0.018 to 0.026 secs
  • SELECT COUNT(DISTINCT n,m) FROM TC :: 0.007 to 0.012 secs

This means that we do no longer have to be afraid of repeated consistency checks!

“Alternative”
If our conditions are such that we can follow the alternative discussed above the helper table is of even more help. The reason is that we can omit the determination of our reference value $ref_dist_i. Instead we fill the (modified) helper table with

INSERT INTO TC (m, n, sum_distinct_i, sum_i, min_i, max_i) SELECT a.m, a.n, COUNT(DISTINCT a.i), COUNT(a.i), MIN(DISTINCT i), MAX(DISTINCT i) FROM TA a GROUP BY a.n, a.m

This costs us only slightly more time than our original value. With a proper index generated over the columns sum_distinct_i, sum_i, min_i, max_i we the use

SELECT COUNT(DISTINCT sum_distinct_i, sum_i, min_i, max_i) FROM TC :: < 0.002 sec

In case that everything is OK this should give us a “1” in almost no time. Then additionally we fetch just one line

SELECT (*) FROM TC LIMIT 1

and check the conditions [Max(Distinct i) – Min (DISTINCT i) + 1] == [COUNT (DISTINCT i)] == [COUNT i] for this node, only.
The big advantage is that we win the time otherwise spent to determine the COUNT(DISTINCT i) value for our big table. This advantage gets bigger with a growing table.

Remark regarding indices: As you have realized we clutter the helper table with indices over various columns. Normally, I would characterize this as a bad idea. But in our case I do not care – compared to other things it is a small prize to pay for the relatively small space the numerous indices consume for our (relatively) small helper table.

The resulting sum of additional time investments for preparing the consistency checks

We
find for our 5 million record table that we need to invest some extra time of

  • around 2.2 secs for creating a helper table

in the best case. If we cannot use the “Alternative” we additionally need around 2 secs for getting the number of distinct i-values in our table.

So, this makes 1% in the best case up to 16 % of the original loading time (26.3 secs). I think, this is a good compromise as it gives us a repeatable consistency check for a given status of the tables.

Extrapolation to table enlargements resulting from loading a sequence of further data files

What happens if we deal with larger data amounts sequentially loaded into our big data table from several csv-files? We expect things to behave reasonably well:
A critical point is that the helper table TC should only get new records originating from the new file loaded. Already saved aggregated values should not be changed. This in turn requires that the data in the new files belong to nodes that are distinct from already loaded nodes. So, the data imported must be organized by nodes and systematically spread over the files to be loaded. The relevant data records for the new aggregations can easily be identified by an evaluation of an auto-index column in the big data table before and after the loading process. The aggregation results for the new nodes are than added to the helper table without any interference with the already aggregated values stored there.

Therefore, the time to generate additional entries in our helper table should remain a constant per csv-file to be imported!

However – if we need to do it – the determination of the COUNT(DISTINCT i) would require more and more time for our growing big table.

So, at this point we will really profit significantly from the discussed “Alternative”.

The time to perform our checks on the helper table will also grow – but in comparison to other times this remains a really small contribution.

Summary

In our scenario we had to find a compromise between index generation and a resulting prolongation of data loading and preparation time versus a faster consistency check (partially) based on aggregated COUNT(DISTINCT) values for (grouped) key columns. Such checks typically lead to sub-selects with derived result sets. Even if the temporary tables implicitly used by the RDBMS engine are relatively small compared to the original big tables any further aggregation or analysis on the derived data may cost considerable time. If such analysis is to be done more often than once the creation of a helper table with aggregated data may pay off very quickly.

In addition we have seen that one should carefully evaluate alternatives to verify consistency conditions and what information can and should be used in the verification process. Not all obvious ways are the best in terms of performance.

Addendum, 21.09.2014:
Is this the end of a reasonable line of thinking ? No, it is not, if you want to make your big data imports faster.
Creating a helper table is useful. But there are more points to take into account when trying to optimize:

The reader has probably noticed that the creation of a unique [n,m,i]-index enlarged the loading time drastically. This observation sets a huge question mark behind the fact that I had a primary index on the first auto-incremented “nr”-column. Furthermore, what about the impact of index building in parallel to the processing of “DATA LOAD INFILE” in general? We did not discuss whether we could accelerate the import time by omitting index building throughout the “LOAD DATA” import completely and creating the index instead afterwards! And we assumed that index creation during the import of a sequence of files would behave reasonably – i.e. linearly with the growing number of records. Actually, as further tests show the latter is not
always true and separating index creation from the execution of “LOAD DATA INFILE” is an important optimizing step! See a forthcoming article for a more detailed discussion of these interesting topics!

PHP/MySQL/Linux: File upload, database import and access right problems

On an (Apache) web server one should establish a policy regarding access rights to the files of hosted (virtual) domains. Especially with respect to files transferred by FTP or by uploads for a browser. It was an interesting experience for me that uploading files during an Ajax communication with a PHP program and moving them to target directories with PHP’s “move_uploaded_file” can collide with such policies and lead to unexpected results.

The problem

I recently tested Ajax controlled csv file transfers from a browser to a web server with a subsequent loading of the file contents to a database via PHP/MySQL. The database import was initiated by PHP by executing the SQL command “LOAD DATA INFILE” on the MySQL server. This chain of processes worked very well:

The uploaded csv file is moved from PHPs $_FILES superglobal array (which works as an input buffer for uploaded files) to a target directory on the web server by the means of the PHP function “move_uploaded_file”. My PHP program – the Ajax counterpart on the server – afterwards triggers a special MySQL loader procedure via the “LOAD DATA INFILE” command. MySQL then loads the data with very high speed into a specified database table. It is clear that the import requires sufficient database and table access rights which have to be specified by the PHP program when opening the database connection via one of PHP’s MySQL interfaces.

The overall success of the upload and database import sequence changed, however, in a surprising way when I wanted to transfer my (rather big) files in a zip-compressed form from the browser to the web server.

So I compressed my original csv file into a zip container file. This zip file was transferred to the server by using the same a web site formular and Ajax controls as before. On the server my PHP program went through the following steps to make the contents of the zip container available for further processing (as the import into the database) :

  • Step 1: I used “unlink” to delete any existing files in the target directory.
  • Step 2: I used “move_uploaded_file” to save the zip file into the usual target directory.
  • Step 3: I used the “ZipArchive” class and its methods from PHP to uncompress the zip-file content within the target directory.

Unfortunately, the “LOAD DATA INFILE” command failed under these conditions.

However, everything still worked well when I transferred uncompressed files. And even more astonishing:
style=”margin-left:20px;”>When I first uploaded the uncompressed version and then tried the same with the zip-version BUT omitted Step 1 above (i.e. did not delete the existing file in the target directory) “LOAD DATA” also worked perfectly.

It took me some time to find out what happened. The basic reason for this strange behavior was a peculiar way of how file ownership and access rights are handled by “move_uploaded_file” and by the method ZipArchive::extractTo. And in a wiggled way it was also due to some naivety on my side regarding my established access right policy on the server.

Note in addition: The failures took place although I used SGID and setfacl policies on the target directory (see below).

User right settings on my target directory

On some of my web servers – as the one used for the uploads – I restrict access rights to certain directories below the root directories of some virtual web domains to special user groups. One reason for this is the access control of different developer groups and FTP users. In addition I enforce automatic right and group settings for newly created files in these directories by ACL (setfacl) and SGID settings.

The Apache process owner becomes a member of the special group(s) owning these special directories. Which is natural as PHP has to work with the files.

Such an access policy was also established for the target directory of my file uploads. Let us say one of these special groups would be “devops” and our target directory would be “uploads”. Normally, when a user of “devops” (e.g. the Apache/PHP process)

  • creates a file in the directory “uploads”
  • or copies a file to the directory “uploads”

the file would get the group “devops” and “-rw-rw—-” rights – the latter due to my ACL settings.

What does “move_uploaded_file” do regarding file ownership and access rights ?

The first thing worthwhile to note is that the politics for what this PHP function does regarding ownership settings and access rights changed at some point in the past. It furthermore seems to depend on the fact whether the target directory resides on a different (mounted) file system. See the links given at the bottom of this file for more information.

Some years ago the right settings for the moved file in the target directory were “0600”. This obviously has changed :

  • The right settings of moved files today are “0644” (independent of file system questions).

However, what about the owner and the group of the moved file? Here “move_uploaded_file” seems to have it’s very own policy:

  • It sets the file owner to the owner of the Apache web server process (on my Opensuse Linux to “wwwrun“).
  • It sets the group always to “www” – and it does so
    independent of

    • whether the SGID sticky bit is set for the target directory or not (!) ,
    • whether the Apache web server process owner really is a member of “www” (!) ,
    • whether a file with the same name is already existing in the target directory with a maybe different group !

Funny, isn’t it? Test it out! I find the last 3 facts really surprising. They do not reflect a standard copy policy. As a result in my case the uploaded file always gets the following settings when saved to the file system by “move_uploaded_file”:

owner “wwwrun” and group “www” and the access rights “0644”.

So, after the transfer to the server, my target file ends up with being world readable!

What does ZipArchive do with respect to ownership and access rights ?

As far as I have tested “ZipArchive::extractTo” I dare say the following: It respects my sticky bit SGID and ACL settings. It behaves more or less like the Linux “cp” command would do.

So, when ZipArchive has done it’s extraction job the target file will have very different settings:

owner “wwwrun”, BUT group “devops” and the access rights “0660”.

However, this is only the case if the target file did not yet exist before the extraction !
If a file with the same name already existed in the target directory “ZipArchive::extractTo” respects the current owner/group and access rights settings (just as the cp command would have done!). Test it out!

The impact on MySQL’s “LOAD DATA INFILE …”

The MySQL process has its own owner – on my system “mysql”. When PHP issues the SQL command “LOAD DATA INFILE” by one of it’s MySQL interfaces the MySQL engine uses an internal procedure to access the (csv) file and loads it’s content into a database table. You may rightfully conclude that the opening of the file will be done by a process with “mysql” as the owner.

nSo, it becomes clear that not only the rights of the Apache/PHP process or database access rights are important in my scenario described above:

The MySQL (sub-) process itself must have access rights to the imported file!
 
But as we have learned: This is not automatically the case when the ZipArchive extraction has finalized – if “mysql” is not by accident or purpose a member of the group “devops”.

Now, take all the information given above together – and one understands the strange behavior:

  • When I uploaded the uncompressed file it got rights such that it was world readable and it’s contents could therefore be accessed by “mysql”. That guaranteed that “UPLOAD DATA INFILE” worked in the first standard scenario without a zip file.
  • If the target file is not deleted and exists before a zip extraction process it will be rewritten without a change of it’s properties. That makes the “LOAD DATA INFILE” work also after a zip extraction as long as we do not delete a previously existing target file from a standard upload with the same name and world readability.
  • However, in case of an emptied target directory the extraction process respects the SGID and ACL settings – and then “mysql” has no read access right for the file!

A conclusion is that I had stumbled into a problem which I had partially created by myself:

I had established a policy for group and right settings which was respected when extracting zipped files. This policy collided with required file access rights for the MySQL processes. Stupid me! Should have taken that into account !

Accidentally, my politics was not respected when uploading and handling standard files directly without encapsulating it in a zip container. Due to the failure of my policy and the disregard of special right settings by “move_uploaded_file” the subsequent MySQL process could use “LOAD UPLOAD INFILE”. Not what I wanted or had expected – but the policy violation almost went undetected due to the overall success.

Conclusions

I hope that the points discussed in this article have made it clear that the access rights of uploaded and/or zip-extracted files are something to take good care of on a Linux web server:

Do not rely on any expectations regarding of how PHP functions and objects may handle user/group ownerships and access rights! In case of file uploads (with and without zip containers) test it out, check the behavior of your programs and set the rights with available PHP file handling functionality explicitly to what to want to achieve.

The consequences of the peculiar rights settings of “move_uploaded_file” and it’s ignorance of SGID/ACL policies must be taken into account. Other PHP routines may handle access rights of copied/moved/extracted files differently than “move_uploaded_file”. Furthermore, if you want to load data from uploaded (csv) files into the database take into account that the “mysql” user needs read access rights for your file.

Solutions for my scenario with zip files

Let us assume that I follow my advice above and set the rights for files saved by “move_uploaded_file” explicitly, namely such that my policy with the group “devops” and the “0660” rights is respected. Than – without further measures – no uploaded file could be handled by the MySQL “UPLOAD DATA INFILE” mechanism. In my special case one could think about several possible solutions. I just name 3 of them:

Solution 1: If – for some reasons – you do not even temporarily want to break your SGID/ACL policies for the uploaded and extracted file in your target directory, you could make “mysql” a member of your special group (here “devops”). Note that this may also have some security
implications you should consider carefully.

Solution 2: You may set up another target directory with special access rights only to “mysql” and “wwwrun” by defining a proper group. This directory would only be used to manage the data load into the database. So, copy or move your uploaded/extracted file there, set proper rights if necessary and then start the “UPLOAD DATA” process for this target file. And clean the special directory afterwards – e.g. by moving your upload file elsewhere (for example to an archive).

Solution 3: You may change the access rights and/or the group of the file temporarily by the PHP program before initiating the MySQL “LOAD DATA” procedure. You may reset rights after the successful database import or you may even delete or move the file to another directory (by rename) – and adjust its rights there to whatever you want.

Whatever you do – continue having fun with Ajax and PHP controlled file uploads. I shall come back to this interesting subject in further articles.

Links

New access right settings of move_uploaded_file
https://blog.tigertech.net/posts/php-upload-permissions/
Old access right settings by move_uploaded_file and ignorance of SGID sticky bit
http://de1.php.net/manual/en/function.move-uploaded-file.php#85149
cp and access rights (especially when a file already exists in the target directory)
http://superuser.com/questions/409226/linux-permissions-and-owner-being-preserved-with-cp
The use of SGID
http://www.library.yale.edu/ wsg/docs/ permissions/ sgid.htm

 

MySQL/PHP: LOAD DATA – import of large csv files – linearity with record number?

In my last article
Importing large csv files with PHP into a MySQL MyISAM table
I talked a little about my first positive experience with using the “LOAD DATA INFILE …” SQL statement in PHP programs to load csv files with millions of records. Some reader asked me whether the import period behaves linearly with the number of records. Actually, I am not completely sure. The reader has to wait for some more experiments on my side. Right now we are preparing tests for 1, 2, 5 and 10 million line csv files for a certain project I am involved in.

So far, we have passed tests with sequences of 1, 2 and 5 million line (= record) files. For those our results look pretty linear:

  • For 1 million records we needed ca. 7,5 seconds to import them into a MySQL MyISAM table (with index building).
  • For 2 million records we needed ca. 15 seconds.
  • For 5 million records we needed ca. 38 seconds.

The csv data files were originally 32 MB, 64 MB and 165 MB big. We transferred them as compressed zip-files to the server (size reduction by a factor of 4 in our case).

Addendum, 19.09.2014, regarding the absolute numbers for the import time:

I should have mentioned that the absolute numbers for loading times are a bit large. The reason for this is that some indices are defined on the table and that the index trees are built up during the loading process. Actually, one can load our 5 million records in below 6 seconds if no indices are defined on the table. Furthermore, one should separate index building completely from the loading process. Building the index in one step after all data have been loaded may take significantly less time than building it during the loading process – especially if non-unique indices are involved. See a forthcoming article
MySQL: LOAD DATA INFILE, csv-files and index creation for big data tables
about this interesting topic.

Furthermore, the measured time intervals included the following main actions

  • jQuery/PHP: transfer of a zip file from a control web page to the server via Ajax in a local network
  • PHP: saving of the transferred on a special directory on the server
  • PHP: unzipping the file there
  • PHP/MySQL: selecting information from control tables (control parameter)
  • PHP: checking present status of several target tables by various selects (select for certain consistency numbers not only from the filled table)
  • PHP/jQuery: return of an Ajax message to the browser confirming the successful file transfer to the server
  • jQuery/PHP: automatic start of a new Ajax transaction from the client with control parameters for the import and subsequent checks
  • PHP/MySQL: truncation of existing import table and dropping of views
  • PHP/MySQL: importing the data by a “LOAD DATA INFILE … ” SQL statement issued by the PHP program
  • PHP/MySQL: creating (3) views
  • PHP/MySQL: checking present status of several target tables by various selects (consistency numbers)
  • PHP/jQuery: Return of an Ajax message to the client – confirming the successful import and delivering consistency data

So, actually it is a bit difficult to
extract the real required time for the file import by “LOAD DATA” from this bunch of very different activities. However, the whole sequence should behave approximately linearly if everything works correctly. (If you wonder why we have this two fold Ajax transaction structure – wait for more articles in this blog to come.)

The rather linear behavior for our 1 to 2 million files is also confirmed by the following graphs

Graph for 1 million records

file_upload_1mio_mysql_cut

Graph for 2 million records

file_upload_2mio_mysql_cut

Graph for 5 million records

file_upload_5mio_mysql_cut

You can see the two fold SELECT structure resulting form the action sequence described above.

Please, note the systematic rise of intermediately used memory from around 700 MByte up to 1.3 GByte! I conclude that the linear behavior is valid only as long as the system has enough free RAM to deal with the data.

I shall supplement more measurement data when we deal with larger files. Our hope is that we can confirm a more or less linear behavior also for larger import files with more records (provided that we supply our virtualized server system with sufficient RAM). However, you can never be sure – when the files get really huge some other memory limitations somewhere on the server (by system or service configuration parameters) may reduce the loading performance drastically. We shall see …

Importing large csv files with PHP into a MySQL MyISAM table

In one of my present projects there is a requirement to upload tens of millions of records from several csv files into a database. The files are uploaded to the server with the help of Ajax controlled browser/server-transactions. On the server the file contents should be checked and then transferred into an appropriate database table. In my case a MyISAM table.

This is the first time that I had to deal with such big record numbers on a MySQL database. On an Oracle databases I worked with around 30 million record tables some years ago – but those data came from other sources and were supplied more continuously than in form of a one step csv file import.

Although the data structure and table structures for the import are flat the pure amount of data makes it necessary to think a little about the way of transferring them from file lines to records of a MySQL database table with the help of a PHP server program. Below, I really do not present fundamentally new insights – a MySQL expert will not find anything new. But maybe others who for the first time stumble into a big data import problem may find some of the hints below useful.

A simple, but time consuming standard approach for a csv import with PHP

In the past when I had to import data from files I often used a sequence of steps comprising

  • the opening of the file for reading,
  • the use of fgetcsv() to read in one line,
  • the usage my own and problem specific perform check methods of the identified fields,
  • the processing of a sequence of one INSERT SQL command for each of the file lines.

Reasons for such an approach were: You have complete control about the data and the import process. You may check the contents of each file line thoroughly already before inserting it. You may send intermediate positive or negative messages to the client via Ajax, etc..

However, after some thorough tests I found the following:

A data import based on the approach described above is really time consuming. To give you an impression: The typical import time I found for my hardware and virtualization conditions was

108 seconds for

  • a standard MySQL 5 database on a small virtualized KVM server instance with one CPU core and 1GB RAM (256 MB reserved for MySQL; host and KVM guest both run Opensuse 13.1),
  • a file with 1 million lines (each containing 6 fields – 1 field for a primary integer key / 3 integer fields together forming another unique key / 1 quantity value field )
  • one INSERT for each line with parallel index building for each of the three integer columns.

That really meant trouble for my rather impatient customer. The extrapolated time for the import of similar contents of 8 additional files and a potential factor of 50 with respect to the final number of file lines/records made him nervous. And he was also aware of the fact that due to index building the total required import time may grow faster than linear with record number.

It may be hard to say good bye to good old habits. But when the upload and import time counts more than control one should look for other import methods. After some tests I found the following approaches which can lead to a significant improvement of import times:

Improvement 1: Work with bunches of records per INSERT

What is the first guess for time consuming operations? It is the overhead the preparation of an SQL statement takes. If such a statement is true one should reduce the amount
of INSERT statements. And really: After some trials I found that you may accelerate things significantly by avoiding to use one INSERT for each record/file line. Working with bunches of records bundled to be saved in just one INSERT statement really helps to save time. This seems in a way to be natural and understandable; see also the following article:
http://dev.mysql.com/doc/refman/5.1/en/insert-speed.html

So, one thing you could try is to put the result of several fgetcsv() calls into one INSERT statement. In a next step you would try to find an optimum number of records per INSERT by systematic testing. This may already give you a substantial acceleration factor.

Improvement 2: Use “LOAD DATA INFILE … “

If you need an acceleration factor above 10 for loading csv data, I strongly suggest that you use a special feature of the MySQL engine and its related SQL extension in your PHP loader program. See:

http://dev.mysql.com/doc/refman/5.6/en/extensions-to-ansi.html
http://dev.mysql.com/doc/refman/5.6/en/load-data.html

The article promises an acceleration factor up to 20. I can confirm a factor of 18:

In a case of a 1 million record csv file (each line for 6 fields, 4 indices generated on 4 columns) I got the following acceleration:

  • Single INSERTS: 108 seconds
  • LOAD DATA : 6 seconds


(Both loading processes were done with some defined and active indices that were built up during the filling of the table. Therefore, forget about the absolute numbers – without any indices you may load 1 million records in a time around or below one second.)

A relative factor of 18 is impressive. However, the usage of an optimized loop like program taking and saving (probably bunches of record data) also may have disadvantages. One question is: How are data with wrong field numbers or wrong data type formats handled? One can imagine that there might be a policy that the program cannot stop because of one defect in one line. The error handling would depend on a reasonable summary of which file lines/records were defect. We come back to this point below.

Improvement 3: Separate the index generation from the data filling process

Even if you use “LOAD DATA…” there is still room for more improvement if you separate the index generation from the filling of the table. This is described in detail in the documentation of the “LOAD DATA” statement. See:
http://dev.mysql.com/doc/refman/5.1/de/load-data.html

I have not tried or investigated this yet; so, I can neither give a general impression of this recipe nor an acceleration factor. But I consider it worth trying when I find the time. Maybe after the first 10 million record import 🙂 .

Addendum, 19.09.2014:

I can now confirm that separating the index creation from the loading process with “LOAD DATA INFILE” may give you another significant factor. This is especially true when unique indices are built in parallel to the data loading. Already omitting a simple auto-incremented unique index over one column may give you a factor of around 2. See a forthcoming article for more details.

Improvement 4: Combine quantities which have the same dependencies on keys in one file and insert them into one and
the same table

In my case we have to import 2 groups of data for N different quantities with the same key dependency – in our case 2 groups with 4 different quantities, each. Due to the requirements of using these data independently in different calculation steps, it seemed to be wise in a first approach to load all these quantities into different tables despite the very same key structure and key dependency (in our case on integer triples describing a 3-dimensional finite numerical grid structure).

However, for the sake of performance one should really reconsider and challenge such a separate table/file strategy and do some performance tests for a combined table/file strategy where all quantities with the same keys reside in one and the same file/table. Reasons:

  • To deliver data for one and the same key combination in one combined file is also a matter of transfer efficiency to the server as the total amount of data to be transferred via the network/Internet gets less.
  • It is not necessary to change your programming statements for calculations with data taken from separate tables if you work with appropriate views of the combined table that “simulate” such separate tables for you. In my case views of the type “MERGE” were appropriate. I could not see any significant performance reductions when using views instead of tables.
  • Regarding csv import the most important effect is that you instead of N times importing a file of the same structure you only import data from one file. That reduces the amount of INSERTs by a factor of N. The question remains how that relates to the fact the each INSERT writes more columns of the import table in the database.

The last point is really interesting. In our test case which may be a bit special we got the following result when using the “LOAD DATA INFILE” statement for loading directly from a file with four quantities combined instead of just one:

For a million records the import time did not change very much in comparison to the time of 6 seconds required for importing a file for just one quantity – actually the difference for loading our file for 4 quantities was below 10%. That means that in our case we gained an additional factor of almost 4 for shortening the total required import time.

So, by using “LOAD DATA INFILE” AND combining data quantities with the same key depency in just one file I could reduce the total loading time for 4 files with each a million records by a factor of around 70.

A hint regarding error handling for LOAD DATA

In our case we still use the old “mysql_query” interface and not “mysqli”. Do not ask me for the reasons – most of it is laziness. From a professional point of view I advice against the usage of the old interface – especially as it may disappear with coming PHP versions. Nevertheless, if you use the old interface you will probably use statements like ( I simplify )

Wrong approach for error control

	$sql_load = 	" LOAD DATA INFILE '" . $this->imp_file . "'" . 
			" INTO TABLE " . $this->tbl_name . 
			" FIELDS TERMINATED BY " . "';'" . 
			// " LINES TERMINTED BY " . "\r\n" . 
			" IGNORE 1 LINES"; 
	
	if (!mysql_query($sql_load, $this->DB->db) === false) {
		// Steps to control errors 
	}
 

(You may need the commented line in case of files generated on Windows).

Do not always expect something reasonable from this error detection approach! My impression is that it works reliably only in case of SQL syntax errors. But not in case of defect lines of the csv file. Instead you may have to explicitly look at the output of mysql_error():

Better approach

	$sql_load = 	" LOAD DATA INFILE '" . $this->imp_file . "'" . 
			....
	}

	if ( mysql_error() != '' ) {
		// Steps to control errors 
	}
 

Or you may even use “Show Warnings“. An explicit look at the existence of errors or warnings is also helpful if the whole import is part of an Ajax transaction and you want send appropriate error messages to the client in case of detected errors during the “LOAD DATA ..” process. This leads us to the next point.

What about the requested checks of the data ?

In our approach centered around “LOAD DATA INFILE ..” we have lost track of another requirement, namely that the file data to be imported should first be checked for errors. Well, in this early stage of experimenting I have two – probably not satisfactory – comments:

Many tests can be made before or after the “LOAD DATA” controlled database import. This holds e.g. for consistency checks like number comparisons. For such points you can perform pretty fast SELECT COUNT() statements on the filled database table. In our case e.g:
For all network nodes (described by tupels of integer key numbers): Are there equal numbers of records in the third key dimension (in our case time) for all nodes?

Other checks may, however, directly concern the quality and type consistency of the input data. So, what happens, if “LOAD DATA …” stumbles across a line with missing fields or a wrong type of data (compared to the database table column definitions) ?
Tn the documentation http://dev.mysql.com/doc/refman/5.1/en/load-data.html it is described what e.g. happens if a line has insufficient fields or too many fields. Automatic type conversions are probably tried, when the type of a field value does not fit.

I have not yet tested what really and exactly happens in such error cases. The official documentation is not at all clear to me regarding this point. It seems to be reasonable to assume that an automatic routine for the import of csv data lines would try anything to get a line into the database table. So, I would expect a standard procedure to compensate missing or too many fields and trying some automatic type conversions before a line is skipped. I would not relly expect that a faulty line will lead to a direct stop of the import process.

In addition I would expect something like a bad record log. Unfortunately, there are indications that such a log is not generated. See:
http://chrisjohnson.blogsite.org/php-and-mysql-data-import-performance/

The documentation http://dev.mysql.com/doc/refman/5.1/en/load-data.html says:

When the LOAD DATA INFILE statement finishes, it returns an information string in the following format:
 
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0

So at least this information could in principle be evaluated. In addition the documentation says:

Warnings occur under the same circumstances as when values are inserted using the INSERT statement (see Section 13.2.5, “INSERT Syntax”), except that LOAD DATA INFILE also generates warnings when there are too few or too many fields in the input row.
 
You can use SHOW WARNINGS to get a list of the first max_error_count warnings as information about what went wrong. See Section 13.7.5.42, “SHOW WARNINGS Syntax”.

Show Warnings” actually also informs about errors. See: http://dev.mysql.com/doc/refman/5.1/en/show-
warnings.html

However, the stupid thing is that you still may not get the relevant information (What record was wrong? What field had wrong values ? ) from this when the INSERT eventually worked. What we really would need is something like a bad record log.

So, right now it seems to me that the impressive reductions of load times when using “LOAD DATA INFILE …” also does have its drawbacks and disadvantages.

Therefore, I am thinking about precursor programs that would open the file and analyze the contents of each line before “LOAD DATA” is afterwards used on the whole file. The overhead for such precursor programs must of course be evaluated by tests. When I have gotten more experience with “LOAD DATA” I will come back to this point in this blog.

Conclusion

Using an SQL statement with “LOAD DATA INFILE …” will reduce the time to import flat CSV data into a corresponding database table significantly by factors of around 18 in comparison to standard methods based on INSERTS per file line. The official MySQL documentation promises a bit more – but the factor measured for a real world example is still impressive. A disadvantage of the “LOAD DATA ..” statement from my point of view is that the handling of errors in a PHP program remains unclear.

Nevertheless, for the time being I still would recommend the usage of “LOAD DATA INFILE …” in PHP programs for uploading csv files – if the performance of a csv import is a dominant objective in your project.

MariaDB, LibreOffice 4.x, Konnektoren – Performance Problem wg. Primary Keys

Vor einigen Tagen habe ich für ein Projekt verschiedene Varianten des direkten Zugriffs von Office-Anwendungen (MS Office, Libreoffice unter Windows und Linux) auf MariaDB -(MySQL-) Datenbanken eines Remote Linux-Server getestet. Dabei habe ich mir für LibreOffice [LO] unterschiedliche Konnektoren für den Datenbankzugriff angesehen – JDBC, ODBC und den direkten, nativen Zugriff über einen MySQL-Konnektor. Letzterer wird unter Opensuse -Linux mit dem LibreOffice RPMs mitgeliefert. Unter Windows steht der Konnektor dagegen als LO-“oxt”-Extension zur Verfügung. Siehe:
http://extensions.libreoffice.org/extension-center/mysql-native-connector-for-libreoffice-4.x/releases/1.0

Bei den Performance-Tests hatte ich dann ein Erlebnis der Sonderklasse, das sicher auch andere interessieren dürfte.

Testvoraussetzungen

Libreoffice habe ich mir in der Version 3.6, der Version 4.0.3 aus dem aktuellen LO Stable Repository des Build Services von Opensuse (12.3) unter Linux angesehen. Unter Windows die Versionen 3.6.7 und 4.0.6. Die Clients (Opensuse 12.3) und Windows 7 unter VMware liefen auf ein und demselben System mit einem i7 Quad 950 Prozessor und Raid-10-System. Der Server ist ein von Strato gehosteter V-Server mit 100Mbit -Anbindung ans Internet und 32Bit Opensue 12.3. Als RDBMS wurde eine MariaDB installiert, die bei Opensuse ja die Nachfolge des MySQL-RDBMS von Oracle angetreten hat.

Die Performance des Serversystems ist hinreichend – wir führen dort Simulationsrechnungen für industrielle Prozess- and Supply Chain Netze mit mehreren Zig-Tausend Knotenpunkten und mehreren zig-tausend Datenbanktransaktionen in wenigen Sekunden durch. Für das bisschen Transfer von Tabelleneinträgen im Bereich von 100 KByte bis zu wenigen MByte zu einem Client reicht die Serveranbindung völlig.

Alle Tests wurden mit SSH-Tunnelverbindungen zum Server durchgeführt. Wie man solche Verbindungen absichert, ist in einem vorhergehenden Artikel
https://linux-blog.anracom.com/2013/11/22/ssh-getunnelter-datenbankzugang-fur-gehostete-lamp-server-i/
ausführlich erläutert. Unter Windows wurden diese Verbindungen über PuTTY realisiert.

Die konkret untersuchten Tabellen hatten zwischen 200 und 220.000 Einträgen. Es handelte sich durchweg um MyISAM- und nicht um InnoDB-Tabellen.

Das Laden einer Datenbanktabelle erfolgt unter LibreOffice so, dass man über BASE Datenbankverbindungen zum RDBMS eines Remote-Servers definiert. Unter CALC läst man sich dann die definierten Datenquellen anzeigen (Taste F4), wählt eine Bank und eine zugehörige Tabelle aus. Deren Werte werden dann in einem besonderen oberen Bereich des CALC-Fensters zunächst als Base-Tabelle dargestellt, die man auch zum Ändern der DB-Werte verwenden kann. Der CALC-Schirm teil sich durch F4 also in einen oberen Bereich, der an Base angelehnt ist und einen unteren Bereich, der eine normale CALC-Spreadsheet-Tabelle enthält.

Der Import der Daten der DB-Tabelle erfolgt dann durch Drag und Drop der Tabelle aus der mittels F4 angezeigten Base-Tabellen-Liste in den darunter liegenden CALC-Spreadsheet-Bereich. Alternativ wählt man in der Base-Tabellen-Anzeige alle Werte aus, wählt ferner ein Zelle im CALC-Tabellen-Bereich als Startpunkt des Imports aus und betätigt dann eine Taste für den Import.

In der CALC-Spreadsheet-Tabelle wird dann ein mit der Datenbank assoziierter Bereich angelegt, der nach etwas Wartezeit mit den RDBMS-Tabellen-Daten gefüllt wird. Dieser mit der Datenbank verbundene Bereich kann später jederzeit mit den aktuellen Daten des RDBMS-Systems upgedated werden. Details hierzu findet man in der LO-Hilfe.
Ich gehe hierauf in diesem Artikel nicht weiter auf di CALC-Anbindung an ein Remote-RDBMS ein.

Erste Tests: Erhebliche Performance-Unterschiede zu MS Excel ??

Der Schock entstand über einen Befund, den ich erst gar nicht glauben wollte. Nach etlichen Stunden, viel Konfigurationsarbeit und vielen Tests mit Zeitmessungen hatte ich folgendes Ergebnis vor mir:

Das reine Laden großer Datenbanktabellen über einen ODBC-Treiber schien unter Excel viel, viel schneller zu funktionieren als unter dem Gespann LibreOffice/Base/Calc mit direktem SQL-Konnektor.

Ich spreche hier wie gesagt von Tabellen mit 10.000 bis 220.000 Einträgen. [Das ist für die Belange des Kunden nicht besonders viel.] Und “schneller” ist dabei so zu verstehen, dass die Kombination Excel 2010/ODBC unter Windows 7 (64Bit) in den Tests zunächst erheblich – d.h. um Faktoren – schneller lief als jede Connector-Variante (ODBC, Direct MySQL-Connector, JDBC) mit Libreoffice/Calc/Base.

Dabei läuft Win 7 bei mir nur unter VMware. Die Linux-Tests wurden auf demselben nativen Linux-Host durchgeführt, der auch VMware beherbergt.

Ich nenne ein paar Zahlen für Tabellen mit ca. 14 Zahlenspalten (Integer, Double) :

  • Laden/Importieren Tabelle mit 224 Records : ca. 1 Sek.
  • Laden/Importieren Tabelle mit 6000 Records : ca. 8 Sek.
  • Laden/Importieren Tabelle mit 60.000 Records : ca. 95 Sek.
  • Laden/Importieren Tabelle mit 138.000 Records : ca. 4:40 Min.

Updates dieser per Base nach CALC importierten Tabellen dauerten etwa genauso lange. Den Versuch, eine Tabelle mit 250.000 Records zu laden, habe ich nach mehr als 10 Minuten Wartezeit abgebrochen. Ich merke ausdrücklich an, dass diese Zeiten nicht durch die übertragene Datenmenge über eine evtl. schlechte Internetanbindung erklärbar sind. Ferner gilt/galt:

Dem gegenüber stand eine Zahl für das Laden einer Tabelle mit 138.000 Records nach Excel per ODBC Konnektor von nur ca. 6-8 Sekunden.

Ich empfand diese extreme Diskrepanz zu Linux wirklich als unakzeptabel.

Man muss dazu sagen, dass die genannten Zahlen für das Linux-System bereits optimale Werte darstellen. Typischerweise ergaben sich auch sehr seltsame Effekte, wenn man zwischen den Tests die LO-Versionen durch Neuinstallationen der RPMs wechselte und die Konfigurationsdateien des aktuellen Linux-Users nicht komplett löschte. Teils lief dann das Laden auch kleiner Tabellen unter Calc unendlich lang. Ich deute diese Instabilitäten im Zusammenhang mit Up- oder Downgrades der RPMs – auch bei Auflösung aller Abhängigkeiten. Vielleicht ist das ein Hinweis darauf, dass irgendwelche Überbleibsel und Konfigurationseinstellungen nach einer Neuinstallation noch zu Interferenzen führen.

Die oben genannten Werte stellten sich nur ein, wenn man die jeweiligen Konfigurationsverzeichnisse im Home-Verzeichnis des Linux-Users nach einem Up- oder Downgrade vollständig gelöscht hatte. Ich konnte sie ferner nur für die 4.0.x-Version erzielen. Die Zeiten unter der 3.6 Version waren z.T. noch erheblich länger.

Schlechte Performance-Werte auch unter Windows

Der wirklich frustrierende Befund für die Datenanbindung von LO an eine MySQL-Datenbank wurde leider durch nachfolgende Tests für die LO-Versionen 3.6 und 4.1.3 unter Windows voll bestätigt.

Einschränkend sei angefügt, dass die Philosophie dessen, was ein Load/Import von Datenbankdaten leisten muss, vermutlich in Excel etwas anders ist als in LO. Aber beide Systeme verankern Datenbank-Verbindungen – in BASE geschieht dies innerhalb der Applikation, unter Windows werden die ODBC-Verbindungen über die Systemsteuerungen im Betriebssystem hinterlegt.

Excel lädt
und importiert auf den ersten Blick eher passiv; auf der LO-Seite sind dagegen jederzeit Änderungen der DB-Inhalte und nachfolgende Updates aus der Datenbank für spezielle datenbank-assoziierte “Bereiche” von Calc-Tabellen möglich. Dennoch:

Schon beim Hantieren unter LO-BASE fallen die Geschwindigkeitsunterschiede ins Auge: BASE aktualisiert bereits beim Scrollen über große Tabellen die angezeigten Daten laufend im Hintergrund aus der angeschlossenen Datenbank. Ich habe keine Einstellung gefunden, das zu unterbinden. Vielleicht war ich dazu schlicht zu blöd.

Es zeigte sich unter Windows zudem, dass ein ODBC-Konnektor fast die gleichen Zeitwerte für den Datenbank-Import nach CALC lieferte wie die direkte MySQL-Anbindung durch den MySQL-Konnektor. Leider erwies sich die Performance einer JDBC-Anbindung noch deutlich schlechter als die der MySQL- und des ODBC-Konnektoren. Dass mindestens der ODBC-Konnektor aber eigentlich deutlich mehr leisten kann, beweisen gerade die hervorragenden Ladezeiten für Excel.

Performance-Problem des MySQL-Konnektors wegen Primary Keys?!

Die Frustration spornte mich zu weiteren Experimenten an. Ganz zufällig stieß ich dann bei weiteren Test unter Linux auf ein Tabelle mit 78.000 Einträgen. Überraschenderweise galt für diese Tabelle: Statt Minuten warten zu müssen, konnte ich die Tabellen-Daten in ca. 4-5 Sekunden über BASE/CALC in ein Calc-Spreadsheet laden.

Bei der weiteren Analyse stellte sich dann heraus, dass ich vergessen hatte, für diese Tabelle einen expliziten Primary Key festzulegen. Das führte zu dem Verdacht, dass explizit definierte Primary Keys evtl. mit verantwortlich für die schlechte Performance der LibreOffice-Anbindung waren – so idiotisch sich das auch anhören mag ….

Lösung: Ersetze Primary Keys durch Unique Keys !

Natürlich habe ich dann mal testweise auch in anderen Tabellen die explizit definierten PRIMARY Keys gedropt und durch schlichte UNIQUE-Keys ersetzt. Ja, das geht: MySQL oder Maria DB nehmen laut Doku stillschweigend den ersten passenden UNIQUE Key und setzen ihn als Primary Key ein. Große Probleme habe ich diesbzgl. bislang nicht erlebt.
Ergebnis meiner Versuche war:

Bei allen Tabellen, in denen der explizit gesetzte Primary Key durch einen reinen Unique Key ersetzt wurde, erfolgte der Daten-Import in LibreOffice-CALC mit annehmbaren Zeiten, die sich von denen von Excel nur noch wenig unterschieden (Faktor < 1,25).

Dabei ergab sich durch gleichzeitiges Beobachten der Datenbank und der Netzverbindung der begründete Eindruck, dass CALC nach dem eigentlichen Laden der Daten aus dem RDBMS noch erhebliche Zeit für den internen Aufbau der CALC-Tabelle aufwendet. Nurmehr Letzteres und nicht mehr das Laden der Daten aus der Bank selbst erwies sich für die vom Primary Index befreiten Tabellen als verantwortlich für die noch feststellbaren Zeit-Unterschiede gegenüber Excel beim Füllen der lokalen Spread-Sheet-Tabellen.

Konkret ergab sich für das Laden der Tabelle mit 138.000 Records eine Zeit von knapp unter 8 Sekunden – anstatt der früheren fast 5 Minuten.

In unserem Projekt habe ich nun in allen RDBMS-Tabellen explizit definierte Primary Keys durch Unique Keys ersetzt. Seitdem können die Anwender auch mit CALC prima auf der MariaDB-Datenbank arbeiten.

Offene Frage : Wieso behindern Primary Keys die LibreOffice Performance so drastisch?

Ich habe leider keine plausible Antwort. Diese Frage müssen die Entwickler der MariaDB klären. Auffällig ist die dauerhaft hohe Transaktionsrate auf dem Server von insgesamt über 200 Transaktion /sec, wenn ein Primary Key vorhanden ist.

Ich sollte abschließend noch betonen, dass ich den positiven
Befund für das Entfernen der Primary keys explizit nur für eine MariaDB getestet habe. Es wäre sicher interessant, das Performance-Verhalten auch unter einer nativen Oracle-MySQL-Datenbank zu untersuchen. Hierfür habe ich leider noch nicht genügend Zeit gefunden.