In this article I want to list and briefly discuss some resource limiting parameters for PHP that may become important when big data files are transferred to a server and afterwards imported into a database. Before I do that, I first want to discuss a related comment of reader.
Data splitting to avoid huge transfer times to the server
In one of my last articles [Importing large csv files with PHP into a MySQL MyISAM table] I recommended using the SQL statement "LOAD DATA INFILE ..." inside PHP programs to import big csv-files into a MySQL database. In addition I recommended to deliver all fields of a record with the same key dependency in one file instead of distributing data fields of records over several separate csv-files and import these files in maybe several separate (denormalized) database tables.
A reader reacted by writing in a mail that a file combining many data fields would of course become much larger than each of the individual files with only few data fields. He pointed out that this may become a problem for the transfer of the file from a client to the server - especially if the server was hosted at a provider. Typical and connected problem areas could be :
- bandwidth of the client-server connection - especially for ADSL connections
- limitations on the server side regarding e.g. file size, maximum job duration or input times
I agree - but only partially.
First, it is true that a limited bandwidth for file uploads may become a problem. An example of a "small" csv file with a million records each containing 7 fields (4 key fields, 3 quantity fields) makes that clear:
Upload time = transfer time to the server :
- In my case the file size was around 35 MByte. To upload such a file with ADSL and a maximum upload rate of 1MBit gives you an effective 5 minutes upload time (better transfer time to the server).
- The transfer time has to be compared with the data import time on the server itself which turns out to be in the region of 6 seconds when using the "LOAD DATA INFILE" directive.
The time difference will get even bigger for larger files. So, we see that the transfer time may soon become a dominant factor when our server is located somewhere in the internet. The question arises, whether required upload times may collide with server settings. This would be one example of a server side resource limitation we need to deal with when working with big data. We come back to a potentially relevant, but disputed parameter later on.
Anyway, I do not agree with data splitting by fields to overcome bandwidth problems:
One reason is that the total upload time will not become smaller. The only advantage is a smaller upload time interval per file. This may help to get a better control over the upload process(es). My real argument against splitting by fields is that the total data import time for several files with a few data fields each but with the same huge number of records may become considerably bigger than for one file with all data fields (and the same number of records). At least if you use the fast "LOAD DATA INFILE" feature of the MySQL engine. So, I would
- either tolerate a relative big transfer time to the server
- or try to organize the data records in such a way that they can be uploaded to the server and imported into the data base sequentially - i.e. by files with the full data field spectrum but with reduced record numbers.
I would not give up the idea of transferring csv-files with as many fields as possible per record line - as long as this is compatible with a normalized data model. Point 2 can be realized very often without problems - and with Ajax and HTML5 technologies it can even be solved in such a way that the transfers are automatically done one after the other (file pipelines). So, if you want to limit the transfer times, split the data by records and not by fields. I.e. transfer several files with bunches of records instead of several files with fewer fields.
Resource limitations require control over server parameter settings
Nevertheless, relatively big transfer times for big files may conflict with some server settings. Also - and more important - the time required for the import into the database or the file size and memory consumption can conflict with limits set on your server. E.g., the maximum time a PHP job is allowed to run on most web servers is limited - as are many other resources given to a PHP job.
Many of the PHP resource limitations we may be confronted with on a Apache Web server are defined by parameter settings in a php.ini-file. On a SuSE system this is typically located at "/etc/php5/apache2/php.ini". Many providers deny access to these global settings - although one could in principle influence them for certain scripts by .htaccess-files or by putting php.ini files into the directories where your PHP-scripts reside.
This is one of the reasons why I urge my customers to rent a root server or at least a virtualized server of some provider when seriously dealing with big data. You need to have control over a variety of server settings. Sometimes also general settings and not only php.ini-parameters. Working with big data files and applications dealing with tens or hundreds of millions of records in joined tables require full server control especially during the development phase. This is in my opinion nothing for simple web site hosting.
Some relevant php.ini parameters
What are typical php.ini parameters that limit resources for PHP processes on a server and may get relevant for file uploads? I think the most important ones are the following:
- max_execution_time :: default value : 30 (many providers limit this to 10)
- upload_max_filesize :: default: 2 M
- post_max_size :: default : 8M
- memory_limit :: default :: 128 M
- max_input_time :: default value : 60
- session.gc_maxlifetime :: default : 1440
The time numbers are given in secs. The memory numbers in Megabytes.
Most of these parameters are mentioned in the following web articles which comment their impact on file uploads. Please, have a look at these web pages.
Important parameters for the transfer of big data files are "upload_max_filesize" and "post_max-size". I want to stress the following point:
Both of these parameters have to be set consistently when dealing with file uploads. (Actually, in the past I sometimes forgot that myself and wasted some time with wondering why a file did not get loaded although I had set a sufficient value for upload_max_filesize).
When you upload a file plus some additional POST data the total amount of POST data can be bigger than just the file size. However, "if the size of post data is greater than post_max_size, the $_POST and $_FILES superglobals are empty" - according to the PHP manual. You may even get no warnings or errors in some situations. Therefore, the value of "post_max_size" should always be bigger than the value of "upload_max_filesize" - and the latter should of course be at least as big or a bit bigger than the size of the file you plan to transfer to a server.
An important parameter is also the "max_execution_time". It limits the time a PHP process is allowed to run. It should be big enough to cover the file handling and database import times on the server. Transfer times are not included if we believe the hints given at http://php.net/manual/de/features.file-upload.common-pitfalls.php.
In my understanding the "max_input_time" parameter limits the time for "parsing" request data (= POST or GET data). See http://php.net/manual/en/info.configuration.php#ini.max-input-time
However, a comment on the following page
says something different:
Warning - max_input_time sets the maximum time, in seconds, the script is allowed to receive input; this includes file uploads. For large or multiple files, or users on slower connections, the default of 60 seconds may be exceeded."
Personally, I have some doubts about this last statement as I never experienced any problem with the standard settings and files above 200 MB. So a contrary position would be:
"Parsing" refers to the time interval between the arrival on the web server and before a PHP script starts executing - so "max_input_time" would not include the upload time (= transfer time to the server). However, it would include the time to prepare the superglobal arrays as $_GET, $_POST, $_FILE.
This interpretation makes some sense to me. However, I did not test it thoroughly.
Interestingly enough, there is some dispute about the meaning of the max_input_time parameter and its relation to the max_execution_time parameter on the internet. See:
I have no definite opinion about this discussion, yet. In two of the articles listed above one can see that even the error messages referring to conflicts with the max_input_time parameter can be misleading as they wrongly refer to the max_input_time. Therefore, I recommend be aware of the parameter "max_input_time", although the default should be sufficient for most cases. Setting it to "-1" corresponds to an unlimited max_input_time interval - whatever it really has an impact on.
If you really work with huge files of several hundred megabytes even the garbage collection time may become important. gc_maxlifetime sets a life time limit for data until the system regards them as garbage. So, if your import or data handling times get big enough also this parameter must be adjusted.
The "memory_limit" of a PHP-Scripts may also be reached when processing huge files.
So, you see there is a whole bunch of parameters which you may need to adjust when you start working with larger and larger files.
Web hosting providers and php.ini-settings
You find typical settings of php.ini-parameters for the web hosting packages of the providers Strato and 1&1 here.
Most web providers also give more or less clear information about which of the PHP relevant parameters can be manipulated by customers (e.g. by the use of .htaccess- or directory specific php.ini-files) and which not. See e.g.:
At some providers you can use a .htaccess file, for others you may need to put a php.ini file into each script directory. See e.g.: