PHP: Upload/import of large csv files, server resource limitations, php.ini

In this article I want to list and briefly discuss some resource limiting parameters for PHP that may become important when big data files are transferred to a server and afterwards imported into a database. Before I do that, I first want to discuss a related comment of reader.

Data splitting to avoid huge transfer times to the server

In one of my last articles [Importing large csv files with PHP into a MySQL MyISAM table] I recommended using the SQL statement “LOAD DATA INFILE …” inside PHP programs to import big csv-files into a MySQL database. In addition I recommended to deliver all fields of a record with the same key dependency in one file instead of distributing data fields of records over several separate csv-files and import these files in maybe several separate (denormalized) database tables.

A reader reacted by writing in a mail that a file combining many data fields would of course become much larger than each of the individual files with only few data fields. He pointed out that this may become a problem for the transfer of the file from a client to the server – especially if the server was hosted at a provider. Typical and connected problem areas could be :

  • bandwidth of the client-server connection – especially for ADSL connections
  • limitations on the server side regarding e.g. file size, maximum job duration or input times

I agree – but only partially.
First, it is true that a limited bandwidth for file uploads may become a problem. An example of a “small” csv file with a million records each containing 7 fields (4 key fields, 3 quantity fields) makes that clear:

Upload time = transfer time to the server :

  • In my case the file size was around 35 MByte. To upload such a file with ADSL and a maximum upload rate of 1MBit gives you an effective 5 minutes upload time (better transfer time to the server).
  • The transfer time has to be compared with the data import time on the server itself which turns out to be in the region of 6 seconds when using the “LOAD DATA INFILE” directive.

The time difference will get even bigger for larger files. So, we see that the transfer time may soon become a dominant factor when our server is located somewhere in the internet. The question arises, whether required upload times may collide with server settings. This would be one example of a server side resource limitation we need to deal with when working with big data. We come back to a potentially relevant, but disputed parameter later on.

Anyway, I do not agree with data splitting by fields to overcome bandwidth problems:
One reason is that the total upload time will not become smaller. The only advantage is a smaller upload time interval per file. This may help to get a better control over the upload process(es). My real argument against splitting by fields is that the total data import time for several files with a few data fields each but with the same huge number of records may become considerably bigger than for one file with all data fields (and the same number of records). At least if you use the fast “LOAD DATA INFILE” feature of the MySQL engine. So, I would

  1. either tolerate a relative big transfer time to the server
  2. or try to organize the data records in such a way that they can be uploaded to the server and imported into the data base sequentially – i.e. by files with the full data field spectrum but with reduced record numbers.

I would not give up the idea of transferring csv-files with as many fields as possible per record line – as long as this is
compatible with a normalized data model. Point 2 can be realized very often without problems – and with Ajax and HTML5 technologies it can even be solved in such a way that the transfers are automatically done one after the other (file pipelines). So, if you want to limit the transfer times, split the data by records and not by fields. I.e. transfer several files with bunches of records instead of several files with fewer fields.

Resource limitations require control over server parameter settings

Nevertheless, relatively big transfer times for big files may conflict with some server settings. Also – and more important – the time required for the import into the database or the file size and memory consumption can conflict with limits set on your server. E.g., the maximum time a PHP job is allowed to run on most web servers is limited – as are many other resources given to a PHP job.

Many of the PHP resource limitations we may be confronted with on a Apache Web server are defined by parameter settings in a php.ini-file. On a SuSE system this is typically located at “/etc/php5/apache2/php.ini”. Many providers deny access to these global settings – although one could in principle influence them for certain scripts by .htaccess-files or by putting php.ini files into the directories where your PHP-scripts reside.

This is one of the reasons why I urge my customers to rent a root server or at least a virtualized server of some provider when seriously dealing with big data. You need to have control over a variety of server settings. Sometimes also general settings and not only php.ini-parameters. Working with big data files and applications dealing with tens or hundreds of millions of records in joined tables require full server control especially during the development phase. This is in my opinion nothing for simple web site hosting.

Some relevant php.ini parameters

What are typical php.ini parameters that limit resources for PHP processes on a server and may get relevant for file uploads? I think the most important ones are the following:

  • max_execution_time :: default value : 30 (many providers limit this to 10)
  • upload_max_filesize :: default: 2 M
  • post_max_size :: default : 8M
  • memory_limit :: default :: 128 M
  • max_input_time :: default value : 60
  • session.gc_maxlifetime :: default : 1440

The time numbers are given in secs. The memory numbers in Megabytes.

Most of these parameters are mentioned in the following web articles which comment their impact on file uploads. Please, have a look at these web pages.
http://php.net/manual/en/features.file-upload.php
http://php.net/manual/de/features.file-upload.common-pitfalls.php

Important parameters for the transfer of big data files are “upload_max_filesize” and “post_max-size”. I want to stress the following point:

Both of these parameters have to be set consistently when dealing with file uploads. (Actually, in the past I sometimes forgot that myself and wasted some time with wondering why a file did not get loaded although I had set a sufficient value for upload_max_filesize).
 
When you upload a file plus some additional POST data the total amount of POST data can be bigger than just the file size. However, “if the size of post data is greater than post_max_size, the $_POST and $_FILES superglobals are empty” – according to the PHP manual. You may even get no warnings or errors in some situations.
Therefore, the value of “post_max_size” should always be bigger than the value of “upload_max_filesize” – and the latter should of course be at least as big or a bit bigger than the size of the file you plan to transfer to a server.

An important parameter is also the “max_execution_time”. It limits the time a PHP process is allowed to run. It should be big enough to cover the file handling and database import times on the server. Transfer times are not included if we believe the hints given at http://php.net/manual/de/features.file-upload.common-pitfalls.php.

In my understanding the “max_input_time” parameter limits the time for “parsing” request data (= POST or GET data). See http://php.net/manual/en/info.configuration.php#ini.max-input-time
However, a comment on the following page
http://php.net/manual/de/features.file-upload.common-pitfalls.php
says something different:

Warning – max_input_time sets the maximum time, in seconds, the script is allowed to receive input; this includes file uploads. For large or multiple files, or users on slower connections, the default of 60 seconds may be exceeded.”

Personally, I have some doubts about this last statement as I never experienced any problem with the standard settings and files above 200 MB. So a contrary position would be:

“Parsing” refers to the time interval between the arrival on the web server and before a PHP script starts executing – so “max_input_time” would not include the upload time (= transfer time to the server). However, it would include the time to prepare the superglobal arrays as $_GET, $_POST, $_FILE.

This interpretation makes some sense to me. However, I did not test it thoroughly.

Interestingly enough, there is some dispute about the meaning of the max_input_time parameter and its relation to the max_execution_time parameter on the internet. See:

http://blog.hqcodeshop.fi/archives/185-PHP-large-file-uploads.html
https://bugs.php.net/bug.php?id=53590&
https://bugs.php.net/bug.php?id=28572
http://stackoverflow.com/questions/11387113/php-file-upload-affected-or-not-by-max-input-time
http://www.php.de/php-fortgeschrittene/41473-php-ini-frage-zu-max_input_time.html
https://www.grumpyland.com/blog/101/settings-for-uploading-large-files-in-php/
http://www.devside.net/wamp-server/apache-and-php-limits-and-timeouts
http://www.techrepublic.com/article/a-tour-of-the-phpini-configuration-file-part-2/
http://serverfault.com/questions/224536/is-max-input-time-included-in-max-execution-time

I have no definite opinion about this discussion, yet. In two of the articles listed above one can see that even the error messages referring to conflicts with the max_input_time parameter can be misleading as they wrongly refer
to the max_input_time. Therefore, I recommend be aware of the parameter “max_input_time”, although the default should be sufficient for most cases. Setting it to “-1” corresponds to an unlimited max_input_time interval – whatever it really has an impact on.

If you really work with huge files of several hundred megabytes even the garbage collection time may become important. gc_maxlifetime sets a life time limit for data until the system regards them as garbage. So, if your import or data handling times get big enough also this parameter must be adjusted.

The “memory_limit” of a PHP-Scripts may also be reached when processing huge files.

So, you see there is a whole bunch of parameters which you may need to adjust when you start working with larger and larger files.

Web hosting providers and php.ini-settings

You find typical settings of php.ini-parameters for the web hosting packages of the providers Strato and 1&1 here.
http://strato-faq.de/article/1315/Mit-welchen-Grundeinstellungen-steht-PHP-bei-STRATO-zur-Verf%C3%BCgung.html
https://hilfe-center.1und1.de/skript–und-programmiersprachen-c82634/php-c82686/bedienung-c82739/welche-php-einstellungen-kann-ich-aendern-a791899.html

Most web providers also give more or less clear information about which of the PHP relevant parameters can be manipulated by customers (e.g. by the use of .htaccess- or directory specific php.ini-files) and which not. See e.g.:
https://hilfe-center.1und1.de/skript–und-programmiersprachen-c82634/php-c82686/bedienung-c82739/php-einstellungen-ueber-phpini-aendern-a791901.html

At some providers you can use a .htaccess file, for others you may need to put a php.ini file into each script directory. See e.g.:
http://www.webdecker.de/artikel/items/php-ini-value-htaccess.html

Posted in PHP

MySQL/Aggregation: Comparing COUNT(DISTINCT) values on big tables

Sometimes one needs to work with rather big database tables and with COUNT(DISTINCT) statements. Recently, I had to analyze the result sets of COUNT(DISTINCT) statements for table consistency checks. These tasks required subselects or subqueries. The SQL statements were performed on a test table with more than 5 million records. The data records – together with data of another table – formed the data basis of several complicated computational runs. The eventual record number in our application will be much larger, but some optimization principles can be learned already from 5 million example.

Most of the computed COUNT(DISTINCT) values referred to some of the table’s key columns. Our data records have a combination of three different key fields, which together define a unique key. The combined unique key has a sub-structure corresponding to physical aspects of the real world data objects behind the records. This induces a natural grouping of records.

Such a data structure occurs very often in physical simulations: Think e.g. of a grid of meteorological observation stations that measure environment data at distinct points of a time line. Each station would then be characterized by 2 positional data. The measured quantity values would have a three fold unique key : two geometrical position data and a time value. Then the data for temperature, pressure over time in a database table could be aggregated and grouped according to geometrical positions, i.e. for the stations and their locations.

In my case COUNT(DISTINCT) values had to be computed over one key column and for groups defined by distinct combinations of other keys. The results should be compared with a reference value: the COUNT (DISTINCT) result for the whole table. I shall explain the detailed key structure and the comparison objectives below. Some of the counted distinct values also had to be compared with similar values of another, but much smaller table that contained only around 14000 records for a reduced set of the key columns. The comparisons were part of consistency checks concerning record groups inside the huge table and concerning the equality of distinct key values used in the huge table and the smaller table.

When working on these tasks I first tried to optimize the required sub-select statements for the COUNT(DISTINCT) evaluation on my key columns by defining and using proper indices. After some more analysis, I extended my optimization strategy to creating a helper table instead of working with intermediate but volatile “derived result sets” of my sub-selects. Read below, why …

My case is a little special, but consistency checks regarding the number of records in certain record groups which correspond to distinguished combinations of some keys occur in very many environments (physical simulation, statistics, neuronal networks, ..). Therefore, I hope that the insights I got from my simple scenario may be helpful for other people who have to deal with COUNT(DISTINCT) statements on big tables, too.

The playground, data key structure and some numbers

In our project we load the data of a sequence of csv-files (each with millions of records) into 2 tables of a database. After each import process we check some consistency conditions the records of two tables must fulfill. The reason is simply that we want to avoid interrupted or crashing computation runs due to a lack of data or inconsistent data provided by our tables.

Each record in our big table contains data values that depend on a 3 fold key-structure:

Two keys [m,n] describe certain objects by 2 dimensions in an abstract space. Let us call such an object a “node”. For a “node” some hundred values of several quantities are defined with respect to an additional integer key “[i]” of a further dimension.

The table structure
of our first table “TA” is flat – something like

nr,    m, n, i    Q1, Q2, Q3, ….

with the Qs representing some quantities and the first column being a auto-indexed one. The number distribution of the key values in our case is as follows:

m has only some few distinct values (2 to 10), whereas n may have distinct values in a range between 10000 up to 100000. Distinct i values are in the range between 200 and 1000. In my concrete test table I chose:

  • distinct m values : 1 (as an extreme situation)
  • distinct n values : 13580
  • distinct i values : 386

There is another table “TB” with keys [m,n] and associated quantity values per object. The data of both tables are later on used to perform several calculations. For a sufficient performance of the computational runs we need at least a combined index over the tupel [m,n] – which should be created already during the data import phase. The funny thing in our scenario is that the time required for the data load and import phase is dominant in comparison with all of our computation runs – right now by a factor of 1,5 to 3. So, we really have no intention of making the import time interval any bigger without very good reasons.

Objectives of COUNT and COUNT (DISTINCT ) statements : consistency checks

The consistency of the imported data in the two tables TA and TB must be guaranteed. Some of the consistency conditions that must be fulfilled are:

  1. The number of distinct values for both [m], [n] and [m,n] must be the same in the tables TA and TB
  2. The number of unique [m,n,i] values should be equal to the total number of entries in table TA.
  3. The number of distinct [i] values (i.e. number of distinct records) for any given [m,n]-key pair (i.e. node) should be equal to the total number of records for that [m,n]-pair.
  4. The number of distinct i values (i.e. number of records) of a given [m,n]-key pair in table TA should have an identical [m,n]-independent value. (For all nodes there must be an identical number of i-dependent records).
  5. The number of distinct i-key values of any [m,n] pair should be identical to the number of distinct i-values given for the full table

The last 3 conditions together (!) guarantee that for each [m,n] combination (=node) the different i values of the associated records

  • are all distinct – with the same sequence of values independent of [m,n];
  • lead to the same number of records for every [m,n]-combination.

The following aspects should be considered:

  • The checks are to be performed several times as the big table would be generated stepwise by a sequence of “LOAD DATA INFILE” processes. E.g. 4 times a 5 million record file. A consistency check should in addition always be performed before a computation run is started.
  • The time required for consistency checks add up to the total data import and preparation time for the tables. The effect should be limited to a tolerable factor below 1.3.
  • We expect indices over the key columns to play an important role for the performance of SELECTS for aggregated distinct key values. Building indices, however, may cost substantial additional time during the data import and preparation phase.
  • The performance of certain SELECT statements may depend on the order of columns used in the index definition.

As the data import and preparation time was already dominant we were eager to avoid any intolerable prolongation of this time period.

The impact
of a unique or non unique index over all key columns on the data import and preparation time

In our case we would create a [m,n] or [n,m] index. Such an index was unavoidable due to performance requirements for the computation runs. An [n,m]-index would give us a slight performance advantage during our calculations compared to an [m,n] index. But this difference is marginal. In the data loading phase we import our data into the MySQL base with “LOAD DATA INFILE” statement issued by a PHP program (see ). The index is filled in parallel. We were interested in whether we could afford a full unique index over all 3 key tables.

Below we give some numbers for the required time intervals of our data imports. These numbers always include a common overhead part. This overhead time is due to an Ajax exchange between browser and server, zip file transfer time across our network, PHP file loading times, zip expansion time on the server, movement of files … So the numbers given below do not reflect the pure data loading time into the base on the server. By estimation this time is at least by 4,5 seconds smaller. The test server itself is a relatively small one in form of a KVM virtualized LAMP server with 2 GByte assigned RAM, CPU INTEL Q9550, Raid 10 disk array. For the following tests I worked with MySQL MyISAM tables – however, the general picture should hold for InnoDB tables, too.

Addendum, 21.09.2014:
Please note in addition that our big table has a first column which is auto-incremented and is used as a primary index. This primary index was always created. It explains why the absolute numbers of data import times given below are relatively big. See a forthcoming article
MySQL: LOAD DATA INFILE, csv-files and index creation for big data tables
for a more complete discussion of the impact of (unique and non-unique) indices on data imports from csv files.

We got the following loading times required to fill table TA with data from a csv-file with m=1, n=13580, i=386 (all distinct) by using “LOAD DATA INFILE”:

  • without [n,m,i]-index creation : 23.5 sec.
  • with [n,m]-index creation (non unique) : 25.3 sec
  • with [n,m,i]-index creation (non unique) : 26.3 sec
  • with [n,m,i]-index creation (unique) : 36.2 sec

(compare with the following blog articles
MySQL/PHP: LOAD DATA – import of large csv files – linearity with record number and
Importing large csv files with PHP into a MySQL MyISAM table )

Note that there is no big difference between creating a full [n,m,i]-index over all key columns and creating an index only for the first two columns of our test table. So a full [n,m,i]- index creation seemed to be affordable. It would not hamper the computation runs, but it helps a little with our consistency checks.

However, there is a substantial difference of loading times for a unique and a non unique index. This difference stems of course from related checks which have to be performed during the implicit INSERTs of the data into the table. As we do not really need a unique index for the computation runs the question arises whether we can just use a plain index and still can verify check condition 2 with a better performance than using 10 extra seconds. The answer is – yes we can do that. See below.

Another remark for reasons of completeness:

An [i,n,m]-index instead of a [n,m,i]-index would double the time required for our computation runs. The branching structure of an index tree is performance relevant! in our scenario this is due to the fact that aggregations (sums, statistical information, … ) over the i-dependent quantities for a node play an important role in the calculation steps. Selects or aggregation for distinct [n,m]-nodes and aggregation processes over [n,m]-related record groups require a [n,m,..] index. In case of doubt about the n,mm-order I would say it is better to start with the more numerous branching. Any initial branch chosen then leads to a smaller amount of remaining records for further analysis. Almost all statements discussed below profit from the chosen column and branching order of the index.

What information is really required?

Looking at condition 2 we would say: Let us get the number of all distinct [n,m,i] triples in the table and compare that with the COUNT(*) value for the table. With our given [n,m,i]-index, we could use the following SQL statement:

“SELECT COUNT(DISTINCT n,m,i) FROM TA” :: 1.9 to 2.0 secs .

If we had a unique index the answer would come much faster. However, with our plain [n,m,i]-index such a statement takes 2 secs – which is still considerably less than 10 secs (< 10 secs!). Our [n,m,i]-index is used already for grouping and sorting as EXPLAIN will tell you. So, most of the time is spend for comparing i values of the sorted groups. If there were no other conditions we would have to use this statement. However, conditions 3 to 5 are very strict. Actually, some thinking shows that if we guarantee conditions 3 to 5 we also have guaranteed that the total number of records is identical to the number of distinct records. If for some given [m,n] combination we have y distinct records than we have y different i-values. The rest follows ... Therefore, we should instead concentrate on getting information about the following numbers:

  1. The total number of distinct i values in the table. This would be used as a reference value “$ref_dist_i”.
  2. The total number of i-values ( = number of records ) “num_i” per [m,n]-node (for all nodes).
  3. The total number of distinct i-values “num_dist_i” per [m,n]-node (for all nodes).
  4. The number of records, for which num_dist_i or num_i deviate from $ref_dist_i.

If we find no deviation than we have also confirmed condition 2.

“Alternative”
There is another way to prove condition 2 if there is an additional precondition:

The sequence of i-values given for each node shall be the same for all nodes.

This condition is given in very many circumstances (e.g. in physics simulations) – also in our case. Then we could look at the difference of

Max(Distinct i) – Min (DISTINCT i) for each node

and demand

  • [Max(Distinct i) – Min (DISTINCT i) + 1] == [COUNT (DISTINCT i)] == [COUNT i] per node
  • plus that [COUNT(i)] has the same value for all nodes.

Please note that we would not need to determine the number of distinct i values in our big table at all for this variant of our consistency check.

We shall see below why this can give us a substantial advantage in terms of SELECT times.

To investigate condition 1 we need additionally:

  • The total number of distinct [n,m]-values, distinct m-values and distinct n-values for the tables TA and TB.

Aggregated distinct key values for groups corresponding to leading branches of an index

Let us now consider
SQL statements that determine distinct [n,m]-values, distinct m-values and distinct n-values of table TA. We make the following trivial assuption: COUNT (DISTINCT ) aggregations over one or several columns are much faster if you can use an index defined for these columns. Why ? Because an index has a tree structure and “knows” about the number of its branches which are determined by unique values! Furthermore, also on subsequent branching levels it can be used for grouping and sorting of records ahead of internal value comparisons.

If this is true we would further assume that getting the number of distinct [n,m] values should be pretty fast, because our [n,m,i]- index can be used. And really :

“SELECT COUNT(DISTINCT n,m) FROM TA” :: 0.051 to 0.067 secs

The same holds for “n” alone:

“SELECT COUNT(DISTINCT n) FROM TA” :: 0.049 to 0.067 secs

The marginal difference is not surprising as there is only one [m]-value. EXPLAIN shows that in both statements above the index is used for internal grouping and sorting.

However :

“SELECT COUNT(DISTINCT m) FROM TA” :: 1.4 to 1.6 secs

What we see here is that the index may still be used for grouping – but for each “n”-branch the “m”-index values still have to be compared. Can we make this evaluation a bit faster? What about a sub-select that creates a [n,m]-basis for further analysis ? This could indeed lead to a better performance as ordered [n,m]-pairs correspond directly to leading branches of our index and because the result set of the subquery will be significantly smaller than the original table:

“SELECT COUNT(DISTINCT a.ma) FROM (SELECT DISTINCT n, m AS ma FROM TA ORDER BY n) as a” :: 0,056 to 0,078 secs

Compared to a more realistic (n,m)-value distribution this result is a bit too positive as we only have one distinct m-value. Nevertheless, we see that we can use our already created [(n,m),i]-index with some considerable advantage to work on parts of condition 1 in the list above. Note, that a temporary table for the derived result set of the sub-select is created; EXPLAIN and time measurements of the MySQL profiler will show that. This intermediate table is substantially smaller than the original table and also the number of distinct [m]-values is small (here 1).

What about table TB and these numbers ? Well, TB is just as small as our intermediate table of the last statement! And because it is (relatively) small, we can afford to create whatever index required there – one for [n], one for [m], one for [n,m]. So, we can get all required node information there with supreme speed.

Summary: We can perform consistency checks for condition 1 without any performance problems. Our [n,m]-index is suitable and sufficient for it.

Two additional lessons were learned:
1) Aggregation for key groups and leading index branches over corresponding key columns should fit to each other.
2) Intermediate temporary tables may help for a required aggregation – if they are much smaller than the original table. This leads to SQL statements with subselects/subqueries of the form

SELECT COUNT(DISTINCT a.x) FROM ( SELECT DISTINCT col1 AS x, col2, .. as FROM ….ORDER BY .. ) as a

Aggregating distinct values of a column which is the last in the column definition order of a combined index

Now, let us turn to the determination of the total number of distinct i values. We need this value as a reference value ($ref_dist_i) if we want to verify conditions 3 to 5 ! The fastest way – if we only have our plain, non-unique [n,m,i]-index to get this number – is a simple

SELECT COUNT(DISTINCT i) FROM table TA : 1.85 – 1,95 secs

nAn EXPLAIN shows that this statement already uses the [n,m,i]-index. I did not find any way to make the process faster with just our plain (non unique) [n,m,i]-index. And note: No intermediate table would help us here – and intermediate table would only be a reordered copy of our original table. Therefore:

“SELECT COUNT(DISTINCT a.ia) FROM (SELECT DISTINCT i AS ia, n, m FROM TA ORDER BY n, m ) as a” :: 11.9 secs

To try to make the determination of COUNT(DISTINCT i) faster we could be tempted to create another index which starts it’s tree branching with the i-column or just a plain index for the i-column. We would build such an index during the data load process. And then we would find in our scenario that the extra time to create such an index would be ca. 2 sec. We would first have to invest time in a proper data analysis – even if a subsequent SELECT COUNT(DISTINCT i) FROM table TA would be pretty fast.

So, if we do not get the number of distinct i-values from elsewhere (e.g. from the provider of our csv-files) – we are stuck to something like an additional overhead of 2 secs – which gives us a choice: Create an additional index OR use the simple SQL statement ?

Note that even the creation of a pure i-index would consume considerable memory space on the hard disk (in my concrete example the required index space grew from 138 MB to 186 MB – this total value being not far from the total amount of data in table TA.) Because, up to now I have not found another usage for an i-index I refrain from creating it. I better save the determined number of distinct i values in a tiny additional parameter table to make it accessible for different PHP programs afterwards.

Comparing aggregated COUNT(DISTINCT) values for record groups with a given reference value

Conditions 3 to 5 of our task list require that we compare the reference value [$ref_val_i] for distinct i values of table TA with both the distinct i values and the sum of all i values for each of all the [m,n]-nodes. This leads us to something like the following SQL statement with a subselect/subquery:

Comparison Statement:

SELECT count(*) FROM ( SELECT COUNT(distinct b.i) AS sum_distinct_i, COUNT(b.i) AS sum_i FROM TA b GROUP BY b.n, b.m HAVING sum_distinct_i != $ref_dist_i OR sum_i != $ref_dist_i ) AS a

It gives us the number of nodes with deviations either of the number of associated distinct i-values or of the number of associated i-values from the reference value (here $ref_dist_i). The complicated statement requires some time.

Comparison statement :: < 2.2 secs

Despite the fact that our [n,m,i]-index is used as an EXPLAIN will show. You won’t get that much faster. The alternative

SELECT count(*) FROM ( SELECT COUNT(distinct b.i) AS sum_distinct_i, COUNT(b.i) AS sum_i FROM TA b GROUP BY b.n, b.m ) AS a WHERE a.sum_distinct_i != $ref_dist_i OR a.sum_i != $ref_dist_i

has the same performance. The slow part is the WHERE or HAVING analysis on the derived temporary result set.

By the way:
Working with an [n,m]-index and a separate [i]-index would slow down the comparison statement to > 2.6 secs. So, here we eventually find a slight advantage of our combined index over three columns (also the computation runs get a bit faster). In addition the combined index takes less disk space than several individual indices.

We now should take into account the following aspect of our scenario: We may call such a comparison statement several times if we load a sequence of data files – each with 5 million data records. Then our “comparison statement” will get slower and slower the more records the table comprises. In addition we may also start every computational run with a consistency check of the tables
and show the results in a browser! Then we need to optimize the response time for the web page. Even 2.2 secs may then be too much. How can we meet this challenge ?

Helper table instead of derived result sets from subselects/subqueries

My answer was to use a helper table “TC” – which receives the data of the derived result set from the sub-select of our comparison statement. In table TC we would store something like

m,    n,    sum_distinct_i,    sum_i

and maybe other node dependent values. This has also the advantage to become flexible in case that some day the distinct i values get [n,m]-dependent and have to be compared to some other values.

Note that the number of values in such a table is considerably smaller than in the original table. So, speaking in relative numbers we could generate some indices for [m,n], m, n, sum_dist_i, sum_i without much investment into cpu time and memory! This would also relieve us from complicated statements to determine the number of distinct m values as discussed above. And statements as

SELECT COUNT(*) FROM TC WHERE sum_distinct != $ref_dist_i OR sum_i != $ref_dest_i

would perform very well. (Under our present conditions this should give a plain zero if consistency is given).

Therefore, the only question remains: What would such a helper table cost regarding time ? The proper statement to fill a prepared table TC with indices is

INSERT INTO TC (m, n, sum_distinct_i, sum_i) SELECT a.m, a.n, COUNT(DISTINCT a.i), COUNT(a.i) FROM TA a GROUP BY a.n, a.m

This takes approximately 2.26 secs. However, an additional

SELECT COUNT(*) as sum_dist FROM TC WHERE sum_distinct_i != $ref_dist_i OR sum_i != $ref_dist_i

takes only 0,002 secs:

  • SELECT COUNT(DISTINCT m) FROM TC :: < 0.002 secs
  • SELECT COUNT(DISTINCT n) FROM TC :: 0.018 to 0.026 secs
  • SELECT COUNT(DISTINCT n,m) FROM TC :: 0.007 to 0.012 secs

This means that we do no longer have to be afraid of repeated consistency checks!

“Alternative”
If our conditions are such that we can follow the alternative discussed above the helper table is of even more help. The reason is that we can omit the determination of our reference value $ref_dist_i. Instead we fill the (modified) helper table with

INSERT INTO TC (m, n, sum_distinct_i, sum_i, min_i, max_i) SELECT a.m, a.n, COUNT(DISTINCT a.i), COUNT(a.i), MIN(DISTINCT i), MAX(DISTINCT i) FROM TA a GROUP BY a.n, a.m

This costs us only slightly more time than our original value. With a proper index generated over the columns sum_distinct_i, sum_i, min_i, max_i we the use

SELECT COUNT(DISTINCT sum_distinct_i, sum_i, min_i, max_i) FROM TC :: < 0.002 sec

In case that everything is OK this should give us a “1” in almost no time. Then additionally we fetch just one line

SELECT (*) FROM TC LIMIT 1

and check the conditions [Max(Distinct i) – Min (DISTINCT i) + 1] == [COUNT (DISTINCT i)] == [COUNT i] for this node, only.
The big advantage is that we win the time otherwise spent to determine the COUNT(DISTINCT i) value for our big table. This advantage gets bigger with a growing table.

Remark regarding indices: As you have realized we clutter the helper table with indices over various columns. Normally, I would characterize this as a bad idea. But in our case I do not care – compared to other things it is a small prize to pay for the relatively small space the numerous indices consume for our (relatively) small helper table.

The resulting sum of additional time investments for preparing the consistency checks

We
find for our 5 million record table that we need to invest some extra time of

  • around 2.2 secs for creating a helper table

in the best case. If we cannot use the “Alternative” we additionally need around 2 secs for getting the number of distinct i-values in our table.

So, this makes 1% in the best case up to 16 % of the original loading time (26.3 secs). I think, this is a good compromise as it gives us a repeatable consistency check for a given status of the tables.

Extrapolation to table enlargements resulting from loading a sequence of further data files

What happens if we deal with larger data amounts sequentially loaded into our big data table from several csv-files? We expect things to behave reasonably well:
A critical point is that the helper table TC should only get new records originating from the new file loaded. Already saved aggregated values should not be changed. This in turn requires that the data in the new files belong to nodes that are distinct from already loaded nodes. So, the data imported must be organized by nodes and systematically spread over the files to be loaded. The relevant data records for the new aggregations can easily be identified by an evaluation of an auto-index column in the big data table before and after the loading process. The aggregation results for the new nodes are than added to the helper table without any interference with the already aggregated values stored there.

Therefore, the time to generate additional entries in our helper table should remain a constant per csv-file to be imported!

However – if we need to do it – the determination of the COUNT(DISTINCT i) would require more and more time for our growing big table.

So, at this point we will really profit significantly from the discussed “Alternative”.

The time to perform our checks on the helper table will also grow – but in comparison to other times this remains a really small contribution.

Summary

In our scenario we had to find a compromise between index generation and a resulting prolongation of data loading and preparation time versus a faster consistency check (partially) based on aggregated COUNT(DISTINCT) values for (grouped) key columns. Such checks typically lead to sub-selects with derived result sets. Even if the temporary tables implicitly used by the RDBMS engine are relatively small compared to the original big tables any further aggregation or analysis on the derived data may cost considerable time. If such analysis is to be done more often than once the creation of a helper table with aggregated data may pay off very quickly.

In addition we have seen that one should carefully evaluate alternatives to verify consistency conditions and what information can and should be used in the verification process. Not all obvious ways are the best in terms of performance.

Addendum, 21.09.2014:
Is this the end of a reasonable line of thinking ? No, it is not, if you want to make your big data imports faster.
Creating a helper table is useful. But there are more points to take into account when trying to optimize:

The reader has probably noticed that the creation of a unique [n,m,i]-index enlarged the loading time drastically. This observation sets a huge question mark behind the fact that I had a primary index on the first auto-incremented “nr”-column. Furthermore, what about the impact of index building in parallel to the processing of “DATA LOAD INFILE” in general? We did not discuss whether we could accelerate the import time by omitting index building throughout the “LOAD DATA” import completely and creating the index instead afterwards! And we assumed that index creation during the import of a sequence of files would behave reasonably – i.e. linearly with the growing number of records. Actually, as further tests show the latter is not
always true and separating index creation from the execution of “LOAD DATA INFILE” is an important optimizing step! See a forthcoming article for a more detailed discussion of these interesting topics!