big data | Linux-Blog – Dr. Mönchmeyer / anracon

In some of the last articles in this blog I have discussed the following aspects of a data import from csv-files into MySQL ISAM tables:

The use of the SQL command “LOAD DATA INFILE” may accelerate the import by a factor of at least 18 in comparison to reading the data line by line with fgetcsv() and inserting single records into the database table.
The creation of a (small) helper table may often be useful to improve the performance of repeated consistency checks based on the evaluation of “derived result sets” from COUNT(DISTINCT …) subselects or subqueries on big tables.

The first point implies a straightforward use of “LOAD DATA INFILE” in your PHP programs. The second point may be applicable if certain data structures are given and it requires a balanced use of indices.

See:
Importing large csv files with PHP into a MySQL MyISAM table
MySQL/PHP: LOAD DATA – import of large csv files – linearity with record number?
MySQL/Aggregation: Comparing COUNT(DISTINCT) values on big tables

In these articles I have more or less neglected two things:

A possible splitting of the import process into several sequential processes. Instead of loading one big file we may rather distribute the data records over several files. Each individual process loads data from just one file; so the data are loaded file after file into the table.
The impact of index creation on the data import time.

Reasons for a sequential data loading from multiple csv-files are: Better control of the loading process due to shorter times for well defined data packages and also an adaption to memory limits. However, without tests it is not clear whether such an approach may have an impact on the overall loading performance.

The second point raises the question whether and how index creation influences the efficiency of “LOAD DATA INFILE”. Note that the MySQL documentation explicitly recommends to separate index creation from the loading process. See:
http://dev.mysql.com/doc/refman/5.1/de/load-data.html

This article provides some numbers from test runs for both measures. We shall see how the combination of both points influences the overall performance of the data import. Furthermore, an optimized loading process may influence your strategy regarding consistency checks during a sequential loading of several files.

Loading data from a sequence of multiple files instead of loading from one big file

I briefly have a look at some of the aspects of loading data from a sequence of files. In the article PHP/MySQL/Linux: File upload, database import and access right problems I have already discussed that one may split the data to be imported into bunches of records and distribute them over several files. I have pointed out that the data splitting shall not be done by fields. Instead records should be distributed over multiple files; but each record should contain all of the associated (!) fields if you want to profit from the performance of “LOAD DATA INFILE”.

A further aspect you have to take care of are logical relations between the data records. Such relations may become relevant in case you already want to work with the data in between 2 file loads:

Lets say you have
data with the structure

nr, m,n, i Q1, Q2, Q3

m,n,i shall represent keys which together form an unique key of each record. An [m,n]-combination may e.g. define an object called “node” and [i] may represent time periods. Let us further assume that after having loaded one file you for some reason want to check the complete SUM (Q2) over all (!) i-values for each “node” before you (or a PHP program) import the data of the next file. Then you must organize your data such that all records related to one and the same node are placed inside exactly one file – otherwise you would get a moving target for your SUM when loading your files. In general, the distribution of the records over several files must comply with logical conditions if you want to work with the data already during the loading process.

Another aspect of distributing records over several files may be memory considerations: In one of the articles named above we saw that the loading and index generation consume more and more memory for buffering as the absolute numbers of records grow. Therefore, if you want to avoid conflicts with limited RAM, it may be reasonable to load bunches of records (in form of several csv-files) instead of processing on one gigantic file with “LOAD DATA INFILE” .

In most cases the central objective of using data packages is a better control of the file transfer and loading processes: With multiple files you can build up a “file pipeline” on the server. The files in this “pipeline” are handled by a repeatedly running import job. The required time interval to transfer each file to the server and load its data into the base can be limited to a reasonable value by choosing an appropriate number of records per file. After each file is processed on the server you may analyze the new table status, the pipeline status and send success or failure messages to the client (e.g. a browser). With a web client the communication exchange can be managed by Ajax.

A sequential loading of files also provides you with the possibility to fill the file pipeline on the server in parallel to loading data of already completely received files into the database. Despite a more complicated communication structure between a browser client and the server you can still keep control over both the transfers from the client to the server as well as over a sequence of server processes loading loading data into the base – file by file.

I have realized the sequential handling of several csv-files on a web-server with Ajax and PHP in the following way:
A web client starts an Ajax controlled PHP job that analyzes the number of files in a pipeline (a special directory) on the server and chooses the next file to import by some given rules. After having loaded the data the job returns control to the client and sends information about the processed file, the loaded number of records and the new status of the database table. Depending on parameter settings there and of course depending on the quality of the status information from the server a new server import job for the next file may automatically be started from the client with the help of Ajax from the client. This continues until the pipeline is empty. The last job may create required indices on the filled database table – depending on user settings.

Is the total import time independent of a record distribution over multiple files?

In the article MySQL/PHP: LOAD DATA – import of large csv files – linearity with record number? I discussed the seemingly linear dependency of “LOAD DATA INFILE” on the total record number of ONE file. Now, we should in addition investigate possible performance variations for import processes

that load a given number of records sequentially loaded from
several csv-files;
with or without a parallel building of (unique) indices on defined columns.
that perform the creation of required (unique) indices only AFTER all records have been loaded into the database table.

Importing 5 million records from one or two files

Let us again take the data structure

nr, m,n, i Q1, Q2, Q3

All Q-values are floating point numbers with double accuracy (MySQL type DOUBLE) in our scenario. We assume that at the beginning of our import runs we have already defined some indices on our database table:

a unique index (with auto-incrementation) on the column “nr”
a unique index defined for the columns “[n,m,i]”

In the course of our tests we shall study the effect of

dropping these indices,
creating them whilst filling the table with data,
creating them after having filled the table with data.

For each test case with 5 million records we look at two different situations :

Importing one csv-file with all 5 million records.
Sequentially importing 2 csv-files – one with 3 million records and one with 2 million records.

Our test server is a relatively small one in form of a KVM virtualized LAMP server with 2 GByte assigned RAM, CPU INTEL Q9550, Raid 10 disk array. For the tests I worked with MySQL MyISAM tables. In my concrete test table I chose:

distinct m values : 1 (as an extreme situation)
distinct n values : 13580
distinct i values : 386

I got the following numbers:

import by 1 file	import by 2 files	with unique ai-index on nr	with non-unique index on nr	without any index on nr	with unique index on [n,m,i]	with non-unique index on [n,m,i]	without any index on [n,m,i]	response time
yes	no	yes	no	no	yes	no	no	30 secs
no	yes	yes	no	no	yes	no	no	30 secs
yes	no	no	yes	no	yes	no	no	30 secs
no	yes	no	yes	no	yes	no	no	30 secs
yes	no	no	yes	no	no	yes	no	21,3 secs
no	yes	no	yes	no	no	yes	no	24,8 secs
yes	no	no	no	yes	yes	no	no	19.2 secs
no	yes	no	no	yes	yes	no	no	19.4 secs
yes	no	no	no	yes	no	yes	no	9.8 secs
no	yes	no	no	yes	no	yes	no	13.3 secs
yes	no	no	no	yes	no	no	yes	8.7 secs
no	yes	no	no	yes	no	no	yes	8.8 secs

The accuracy of the response times is in the range of a second – depending on the general situation on the virtualization host and its disk system at the measurements. The response times do not include any file transfer times between the client and the server in our LAN. However, the response time for the situations with a sequential load of several files includes some common Ajax overhead in all cases plus the time to load the PHP program files into our Apache server again. But the system buffers and caches sufficiently. So, the interaction between the client and the server should only have a marginal impact.

What do we learn from the results ? I dare to derive 4 rules:

Rule 1: Any unique index built during “LOAD DATA INFILE” reduces performance (due to checks which have to be performed). So, one should carefully check whether a unique index really is required later on when you work with the data.
Rule 2: The total import time for several or one big file are comparable if a unique index is created whilst or after data loading.
Rule 3: Any building of a non-unique index during the import increases the total import time for a sequential load of several csv-files in comparison to a loading process for just one big csv-file.
Rule 4: Without any index defined the total loading time for several files is comparable to the loading file for one big csv-file.

The most surprising effect for me was that described in rule 3. However, regarding the accuracy in the range of a second we may have some doubts whether we see a real effect. So, we should have a look at a situation with more data records and more files to confirm this rule.

Now, it will of course be interesting to see a comparison between different loading processes with a creation of the [n,m,i]-index AFTER the processing of all files with “LOAD DATA INFILE”. For such cases technically we drop any index before we load any file data and create the index only after the data of all files have been loaded into the base. Note, that index dropping may take some time, too.

Table: Comparison of server response times for index creation AFTER and WHILST data loading 5 million records with “LOAD DATA INFILE”

import by 1 files	import by 2 files	without any index on nr	with unique index building on [n,m,i] whilst loading	with non-unique index building on [n,m,i] whilst loading	with unique index creation on [n,m,i] after loading	with non-unique index creation on [n,m,i] after loading	response time
yes	no	yes	yes	no	no	no	19.2 secs
no	yes	yes	yes	no	no	no	19.4 secs
yes	no	yes	no	yes	no	no	9.3 secs
no	yes	yes	no	yes	no	no	13.2 secs
yes	no	yes	no	no	yes	no	19.5 / 18.5 secs
yes	no	yes	no	no	yes	no	20.7 / 19.1 secs
yes	no	yes	no	no	no	yes	11.9 / 10.8 secs
no	yes	yes	no	no	no	yes	11.4 / 11.1 secs

The first time value given in the last four lines includes an index drop time which varies a bit according to precise server conditions at the time of measurement.

The most interesting result is that the differences measured between a sequential load of several files and a situation with just one big file almost to disappear when we create the index after the pure data loading.

Importing 10 million records from one or five files

Now, let us turn to 10 million records and an equal distribution of records over 5 files (with 2 million records each) and see whether we confirm our results. For the test we have just doubled the number of nodes in comparison to the 5 million case.

Table: Server response times for loading 10 million records with “LOAD DATA INFILE”

import by 1 files	import by 5 files	without any index on nr	with parallel unique index building on [n,m,i]	with parallel non-unique index building on [n,m,i]	with unique index creation on [n,m,i] after loading	with non-unique index creation on [n,m,i] after loading	response time
yes	no	yes	yes	no	no	no	42 secs
no	yes	yes	yes	no	no	no	42,4 secs
yes	no	yes	no	yes	no	no	20 secs
no	yes	yes	no	yes	no	no	37 secs
yes	no	yes	no	no	yes	no	43.3 / 42.0 secs
no	yes	yes	no	no	yes	no	41.3 / 40.0 secs
yes	no	yes	no	no	no	yes	22.9 / 21.8 secs
no	yes	yes	no	no	no	yes	22.5 / 21.4 secs

Again, we see the huge impact of the creation of an unique index – but the total
loading time is almost independent of whether you create the index whilst or after the data loading. This is understandable as the required checks for a uniqueness must be done in both situations.

However, at least for me, it was almost shocking to see the difference between lines 3 and 4:

There is a big difference between a sequential loading process for several files and a loading process for just one file, when a non-unique index was created whilst loading with “LOAD DATA INFILE”.

Note in contrast the considerable reduction of the required time if you create the non-unique index after all data have been loaded into the database. The measure discrepancy in total loading time is really remarkable – and actually I have no convincing explanation for these findings. The tree structure of the index is quite simple; I do not really see what costs the overhead. And I have tested the effect several times to exclude any extraordinary conditions on the server. I have also validated parameter settings and the steps done during the sequence of loading steps to exclude any unplanned work or actions on the server.

The [n,m,i] index in my case has a size of 215 MByte; the data themselves take another 400 MByte. All in all we speak about 600 MByte. So, I suspect some reading and writing from/to the disk at the end/beginning of the individual 5 loader jobs and some inefficient memory management. I have not yet investigated this in detail. Write me a mail if you know the reason.

Independent of the reasons: For me, it was really important to see this difference because I do not need a unique index and I want to work with many more files to cover much larger numbers of records.

So, our result is:

You should avoid the creation of a non-unique index whilst using “LOAD DATA INFILE” on multiple files by sequentially started (PHP) jobs on the server instead of loading just once from one file. Whenever you load your data from several files create the index after all data from all files have been imported into the database table.

Impact on consistency checks during a sequential load of several files

In the article MySQL/Aggregation: Comparing COUNT(DISTINCT) values on big tables I discussed the creation of a helper table and respective indices to get a good performance of consistency checks. Most of these consistency checks were based on an analysis of derived result sets for “SELECT COUNT(DISTINCT … ).. ” statements. My basic idea was to perform such checks after each import process when dealing with a sequence of files.

However, in the named article I also showed that a fast creation of a helper table required already some existing index on the big data table. But our present results favor a solution where an index is created only after all data of all files have been loaded. So, no intermediate consistency checks in between the loading processes for two csv-files !

After some thinking I came to the conclusion that this is no big disadvantage. Even if you consistency checks only after all data have been loaded, you will still be able to identify the records which are problematic. And with some effort, data and file analysis – which will cost some extra time, but maybe worth it – you even may find the file from which the record was loaded. So, you can give the user who tried to load the files a full report on the suspicious records and that caused inconsistencies and the files that contained them.

Conclusion

We have tested different combinations for loading data with one or multiple files and with or without creating indices during/after the data loading. Our tests have shown that if you import data sequentially from several
files and you want to create a non-unique index over some columns, it makes a huge difference whether you create this non-unique index during or after the sequential “LOAD DATA INFILE” processes. One should avoid creating such indices whilst loading the data with “LOAD DATA INFILE”. It should be created after all data from all files have been imported into the database table.

In contrast such an order of steps does not seem to be required when creating unique indices. However, a unique index costs in general significantly more time to be created than a non-unique one. You, therefore, should evaluate the necessity of unique indices.

Linux-Blog – Dr. Mönchmeyer / anracon

Notes about Linux, ML and some simple math …

Tag Archives: big data

MySQL – SELECT INTO OUTPUT FILE – vermeide zusätzliche UNION Statements!

MySQL: LOAD DATA INFILE, multiple csv-files and index creation for big tables