PHP/OO/Schemata: Composite SLAVE objects and (X)HTML generator methods

Some time ago I wrote an article about how to treat groups of associated, related properties of a Real World Object [“RWO”] when representing it by an object [“PWO”] in PHP.

We assumed that the property description of a certain object class is done with the help of a so called “SCHEMA” (a file with a list of definitions and rules or Schema program using definitions saved in a database Schema table). A SCHEMA is specific for a PWO class. A SCHEMA defines object relations, object properties, the relations between certain object properties, the relation of object properties with database fields and of course the type of each property, the property’s own properties as well as associated constraints. Such “SCHEMATA” would play a central role in a web application as they encapsulate vital structural and detail information about objects and their properties.

Every object constructors would refer to an appropriate SCHEMA, as well as web generators for the creation e.g. of form or web template elements would do. Actually, any reasonable (web) application would of course work with multiple SCHEMATA – each for a different object class. SCHEMATA give us a flexible mode to change or adapt a classes properties and relations. They also close the gap between the OO definitions and a relational database.

To handle groups of associated properties in such an environment I suggested the following:

  • Split the SCHEMA-information for properties and related database fields into several “SLAVE-Schemata” which would be built as sub objects within the MAIN SCHEMA object. Each SCHEMA would describe a certain group of closely associated object properties.
  • Create and use “SLAVE PWO SGL objects” as encapsulated sub-objects in a MAIN SGL PWO. Each SLAVE PWO object gets it’s own properties defined in a related SLAVE-SCHEMA.
  • Each of the SLAVE PWO objects receives it’s knowledge about it’s individual SLAVE-SCHEMA by an injection process during construction.

See:
PHP/OO/Schemata: Decouple property groups by combined composite patterns

Each PWO object, representing a single RWO we shall call a SGL PWO. It comprises a series of sub-objects: SLAVE SGL PWOs that refer to corresponding SLAVE Schemata of a MAIN Schema. See the illustration from the article named above:

Slave_schemata

In the named article I had discussed that one can iterate methods for complete database interactions over the MAIN and the SLAVE objects. The same is true for interactions with POST or SESSION data in a CMS like web application. So, there is no need to rewrite core method code originally written for objects which comprise all object properties in just one property/field-SCHEMA without any SLAVE objects. At least with respect to field checks and database interactions.

This works if and when a SGL object is derived from a base class which comprises all required methods to deal with the database, POST and SESSION data. Each MAIN or SLAVE SGL PWO knows itself how to store/fetch it’s property data into/from database tables (or $_POST or $_SESSION arrays) and uses the same methods inherited from common SGL base classes (= parent classes) to do so.

I have meanwhile introduced and realized this SLAVE PWO and SLAVE SCHEMA approach in a quite general form in my web application framework. In this article I briefly want to discuss some unexpected consequences for HTML
generator methods in a web or CMS context.

(X)HTML-Generator methods – where to place them ?

When you design CMS like web applications you always have to deal with Template [TPL] structures and objects that fill the templates with reasonable formatted contents. The contents may be just text or images or links inserted into template placeholders. In more complicated cases (like e.g. the generation of maintenance masks with forms), however, you may have to generate complete HTML fragments which depend on property/field information given in your respective SCHEMA.

A basic design question is: Where do we place the generator methods? Should the SGL PWOs know how to generate their property contents or should rather a “Template Control Object” – a “TCO” – know what to do? I have always preferred the second approach. Reason:

TPL aspects may become very specific, so the methods may need to know something about the TPL structures – and this is nothing that I want to incorporate into the OO representation [PWO] of real world objects.

Over time I have developed a bunch of generator methods for all kind of funny things required in customer projects. The methods are defined in base classes for Template Control Objects and or special purpose sub classes injected into TCOs. A TCO knows about its type of TPL and works accordingly. (By the way: With Pear ITX or Smarty you can realize a complete separation of (X)HTML-code and the PHP code).

(X)HTML-Generator methods – which MAIN or SLAVE PWO and which MAIN/SLAVE SCHEMA are we dealing with ?

In addition to some property/field identifiers a (X)HTML generator method has of course to know what SGL PWO and what SCHEMA it has to work with. This information can be fetched either by the TCO due to some rules or can be directly injected into the methods.

In the past I wanted to keep interfaces lean. In many applications the SGL PWO object was a classical singleton. So, it could relatively easily be received or identified by the central TCO. I did not see any reason to clutter TCO method interfaces with object references that were already known to their TCO object. So, my generator methods referred and used the SGL PWO object and it’s SCHEMA by invoking it with the “$this”-operator:

function someTCOgenerator_method() {
…..
do something with $this->SGL and $this->Schema
….
}

However, in the light of a more general approach this appears to be a too simplistic idea.

If we regard a SLAVE SGL object as a relatively compact entity inside a PWO – as a SLAVE object with its own property and field information SCHEMA – than we see: for a generator method it behaves almost like an independent object different from the MAIN PWO. This situation is comparable to one where the generator method really would be requested to operate on instances a completely different PWO class:

A HTML generator method needs to know the qualities of certain OO properties and associated database field definitions. In our approach with MAIN SGL PWOs comprising composite SLAVE SGL PWO objects each SGL object knows exactly about its associated MAIN or SLAVE Schema object. To work correctly the generator method must get access to this specific SCHEMA. This would in a reasonable application design also be valid for PWO objects representing other, i.e. different types of RWOs.

A (X)HTML generator method can work properly as soon as it knows about

  • the SGL object,
  • the object property (identified by some name, index or other unique information) to operate on and generate HTML code for,
  • the SCHEMA describing the qualities of the object property and related database fields.

This would in
our approach also be given for our SLAVE SGL objects or any PWO as soon as we inject it into our generator method.

Therefore, (X)HTML generator methods of TCOs should be programmed according to the following rules:

  • Do not assume that there is only one defined class of PWO SGL objects that the Template Control Object TCO needs to know about and needs to apply it’s generator methods to.
  • Instead enable the (X)HTML generator methods of a TCO to be able to work with any kind of PWO and its properties – as long as the PWO provides the appropriate SCHEMA information.
  • Inject the SGL [SLAVE] PWO and thereby also its associated [SLAVE] SCHEMA into each TCO generator method:
    function someTCOgenerator_method($SGL_ext, …..).
  • Do not refer to the TCO’s knowledge about a SGL PWO by using the “$this” operator (like “$this->Sgl”) inside a generator method of a TCO; refer all generator action to a local $SGL object reference that points to the injected (!) object $SGL_ext:
    $SGL = $SGL_ext.
    Also refer to a local $Schema which points to the $SGL->Schema of the injected $SGL object:
    $Schema = $Sgl->Schema.
    (Remember: Objects are transferred by reference !)

These recipes give us the aspired flexibility to deal with properties both of SLAVE objects and objects of a different PWO class.

The injection is a small but structurally important difference in comparison to the database interaction methods directly incorporated in SGL objects or better their base classes. Here the TCO (!) method must receive the required information – and we gain flexibility when we inject the SGL object into it (plus information identifying the property for which a HTML fragment has to be generated for).

Had I respected the rules above already some time ago it would have saved me much time now. Now, I had to adapt the interfaces of many of my generator methods in my TCO base classes.

Again, I found one experience confirmed:

  • In case of doubt do not hesitate to use loose object coupling via injection on general methods which could in principle be applied to more and other objects than the ones the object containing the method knows about at the time of your class design.
  • Use injection even if it may look strange when you need to do something like
    $this->someTCOgenerator_method($this->KnownObject);
    i.e., when you inject something the present object containing the method knows about already.

It will save time afterwards when iterator patterns over other objects have to be used and when you may access the (public) method from outside, too.

Iteration over SLAVE objects

Now, if our (X)HTML-generator methods are prepared for injection of SGL PWO objects, we have no more difficulties to generate HTML fragments required in templates for selected properties of MAIN and SLAVE PWO SGL objects:

We just have to iterate the usage of the generator method for the properties/fields of the MAIN SGL PWO as well as its SLAVE SGL PWOs. By getting the SGL object injected the method also knows about the right SCHEMA to use and provide required detail information for the generation of related HTML code (e.g. for a form element).

Think about a situation in which we want to provide a form with input fields which a user can use to update data for all properties of a certain PWO. We just have to apply generator methods to create the required input field types according to the appropriate SCHEMA informations. For this

  • we loop (iterate) over the MAIN and all it’s SLAVE objects,
  • identify all properties according to the MAIN or SLAVE SCHEMA
    information
  • and apply the generator method by injecting the relevant property identifier plus the MAIN/SLAVE-SGL object in question.

Mission accomplished.

The principle of iteration over SLAVE objects is actually nothing new to us: We used it already when dealing with the database interaction methods. (It is a basic ingredient of a composite pattern).

If we only want to work on selected properties then we need to know which of the properties is located in which SLAVE PWO and described in which SLAVE SCHEMA. To be able to do so, we should create and use an array

  • that collects information about which PWO property belongs to which of the PWOs SLAVE objects
  • and which is filled in course of the construction process of a PWO.

Conclusion

When realizing a composite pattern in the form of SLAVE objects (with SLAVE Schemata) to deal with closely associated property groups of complex objects you can apply existing base class methods and iterate over the SLAVE objects to accomplish complete transactions affecting all properties of a structured PWO. This principle can be extended to (X)HTML generator methods of Template Control Objects, if these methods are prepared to receive the SLAVE SGL PWO objects by injection. If we only want to apply generator methods on a bunch of selected properties, we should use an array associating each PWO property with a SLAVE PWO SGL object.

PHP: Upload/import of large csv files, server resource limitations, php.ini

In this article I want to list and briefly discuss some resource limiting parameters for PHP that may become important when big data files are transferred to a server and afterwards imported into a database. Before I do that, I first want to discuss a related comment of reader.

Data splitting to avoid huge transfer times to the server

In one of my last articles [Importing large csv files with PHP into a MySQL MyISAM table] I recommended using the SQL statement “LOAD DATA INFILE …” inside PHP programs to import big csv-files into a MySQL database. In addition I recommended to deliver all fields of a record with the same key dependency in one file instead of distributing data fields of records over several separate csv-files and import these files in maybe several separate (denormalized) database tables.

A reader reacted by writing in a mail that a file combining many data fields would of course become much larger than each of the individual files with only few data fields. He pointed out that this may become a problem for the transfer of the file from a client to the server – especially if the server was hosted at a provider. Typical and connected problem areas could be :

  • bandwidth of the client-server connection – especially for ADSL connections
  • limitations on the server side regarding e.g. file size, maximum job duration or input times

I agree – but only partially.
First, it is true that a limited bandwidth for file uploads may become a problem. An example of a “small” csv file with a million records each containing 7 fields (4 key fields, 3 quantity fields) makes that clear:

Upload time = transfer time to the server :

  • In my case the file size was around 35 MByte. To upload such a file with ADSL and a maximum upload rate of 1MBit gives you an effective 5 minutes upload time (better transfer time to the server).
  • The transfer time has to be compared with the data import time on the server itself which turns out to be in the region of 6 seconds when using the “LOAD DATA INFILE” directive.

The time difference will get even bigger for larger files. So, we see that the transfer time may soon become a dominant factor when our server is located somewhere in the internet. The question arises, whether required upload times may collide with server settings. This would be one example of a server side resource limitation we need to deal with when working with big data. We come back to a potentially relevant, but disputed parameter later on.

Anyway, I do not agree with data splitting by fields to overcome bandwidth problems:
One reason is that the total upload time will not become smaller. The only advantage is a smaller upload time interval per file. This may help to get a better control over the upload process(es). My real argument against splitting by fields is that the total data import time for several files with a few data fields each but with the same huge number of records may become considerably bigger than for one file with all data fields (and the same number of records). At least if you use the fast “LOAD DATA INFILE” feature of the MySQL engine. So, I would

  1. either tolerate a relative big transfer time to the server
  2. or try to organize the data records in such a way that they can be uploaded to the server and imported into the data base sequentially – i.e. by files with the full data field spectrum but with reduced record numbers.

I would not give up the idea of transferring csv-files with as many fields as possible per record line – as long as this is
compatible with a normalized data model. Point 2 can be realized very often without problems – and with Ajax and HTML5 technologies it can even be solved in such a way that the transfers are automatically done one after the other (file pipelines). So, if you want to limit the transfer times, split the data by records and not by fields. I.e. transfer several files with bunches of records instead of several files with fewer fields.

Resource limitations require control over server parameter settings

Nevertheless, relatively big transfer times for big files may conflict with some server settings. Also – and more important – the time required for the import into the database or the file size and memory consumption can conflict with limits set on your server. E.g., the maximum time a PHP job is allowed to run on most web servers is limited – as are many other resources given to a PHP job.

Many of the PHP resource limitations we may be confronted with on a Apache Web server are defined by parameter settings in a php.ini-file. On a SuSE system this is typically located at “/etc/php5/apache2/php.ini”. Many providers deny access to these global settings – although one could in principle influence them for certain scripts by .htaccess-files or by putting php.ini files into the directories where your PHP-scripts reside.

This is one of the reasons why I urge my customers to rent a root server or at least a virtualized server of some provider when seriously dealing with big data. You need to have control over a variety of server settings. Sometimes also general settings and not only php.ini-parameters. Working with big data files and applications dealing with tens or hundreds of millions of records in joined tables require full server control especially during the development phase. This is in my opinion nothing for simple web site hosting.

Some relevant php.ini parameters

What are typical php.ini parameters that limit resources for PHP processes on a server and may get relevant for file uploads? I think the most important ones are the following:

  • max_execution_time :: default value : 30 (many providers limit this to 10)
  • upload_max_filesize :: default: 2 M
  • post_max_size :: default : 8M
  • memory_limit :: default :: 128 M
  • max_input_time :: default value : 60
  • session.gc_maxlifetime :: default : 1440

The time numbers are given in secs. The memory numbers in Megabytes.

Most of these parameters are mentioned in the following web articles which comment their impact on file uploads. Please, have a look at these web pages.
http://php.net/manual/en/features.file-upload.php
http://php.net/manual/de/features.file-upload.common-pitfalls.php

Important parameters for the transfer of big data files are “upload_max_filesize” and “post_max-size”. I want to stress the following point:

Both of these parameters have to be set consistently when dealing with file uploads. (Actually, in the past I sometimes forgot that myself and wasted some time with wondering why a file did not get loaded although I had set a sufficient value for upload_max_filesize).
 
When you upload a file plus some additional POST data the total amount of POST data can be bigger than just the file size. However, “if the size of post data is greater than post_max_size, the $_POST and $_FILES superglobals are empty” – according to the PHP manual. You may even get no warnings or errors in some situations.
Therefore, the value of “post_max_size” should always be bigger than the value of “upload_max_filesize” – and the latter should of course be at least as big or a bit bigger than the size of the file you plan to transfer to a server.

An important parameter is also the “max_execution_time”. It limits the time a PHP process is allowed to run. It should be big enough to cover the file handling and database import times on the server. Transfer times are not included if we believe the hints given at http://php.net/manual/de/features.file-upload.common-pitfalls.php.

In my understanding the “max_input_time” parameter limits the time for “parsing” request data (= POST or GET data). See http://php.net/manual/en/info.configuration.php#ini.max-input-time
However, a comment on the following page
http://php.net/manual/de/features.file-upload.common-pitfalls.php
says something different:

Warning – max_input_time sets the maximum time, in seconds, the script is allowed to receive input; this includes file uploads. For large or multiple files, or users on slower connections, the default of 60 seconds may be exceeded.”

Personally, I have some doubts about this last statement as I never experienced any problem with the standard settings and files above 200 MB. So a contrary position would be:

“Parsing” refers to the time interval between the arrival on the web server and before a PHP script starts executing – so “max_input_time” would not include the upload time (= transfer time to the server). However, it would include the time to prepare the superglobal arrays as $_GET, $_POST, $_FILE.

This interpretation makes some sense to me. However, I did not test it thoroughly.

Interestingly enough, there is some dispute about the meaning of the max_input_time parameter and its relation to the max_execution_time parameter on the internet. See:

http://blog.hqcodeshop.fi/archives/185-PHP-large-file-uploads.html
https://bugs.php.net/bug.php?id=53590&
https://bugs.php.net/bug.php?id=28572
http://stackoverflow.com/questions/11387113/php-file-upload-affected-or-not-by-max-input-time
http://www.php.de/php-fortgeschrittene/41473-php-ini-frage-zu-max_input_time.html
https://www.grumpyland.com/blog/101/settings-for-uploading-large-files-in-php/
http://www.devside.net/wamp-server/apache-and-php-limits-and-timeouts
http://www.techrepublic.com/article/a-tour-of-the-phpini-configuration-file-part-2/
http://serverfault.com/questions/224536/is-max-input-time-included-in-max-execution-time

I have no definite opinion about this discussion, yet. In two of the articles listed above one can see that even the error messages referring to conflicts with the max_input_time parameter can be misleading as they wrongly refer
to the max_input_time. Therefore, I recommend be aware of the parameter “max_input_time”, although the default should be sufficient for most cases. Setting it to “-1” corresponds to an unlimited max_input_time interval – whatever it really has an impact on.

If you really work with huge files of several hundred megabytes even the garbage collection time may become important. gc_maxlifetime sets a life time limit for data until the system regards them as garbage. So, if your import or data handling times get big enough also this parameter must be adjusted.

The “memory_limit” of a PHP-Scripts may also be reached when processing huge files.

So, you see there is a whole bunch of parameters which you may need to adjust when you start working with larger and larger files.

Web hosting providers and php.ini-settings

You find typical settings of php.ini-parameters for the web hosting packages of the providers Strato and 1&1 here.
http://strato-faq.de/article/1315/Mit-welchen-Grundeinstellungen-steht-PHP-bei-STRATO-zur-Verf%C3%BCgung.html
https://hilfe-center.1und1.de/skript–und-programmiersprachen-c82634/php-c82686/bedienung-c82739/welche-php-einstellungen-kann-ich-aendern-a791899.html

Most web providers also give more or less clear information about which of the PHP relevant parameters can be manipulated by customers (e.g. by the use of .htaccess- or directory specific php.ini-files) and which not. See e.g.:
https://hilfe-center.1und1.de/skript–und-programmiersprachen-c82634/php-c82686/bedienung-c82739/php-einstellungen-ueber-phpini-aendern-a791901.html

At some providers you can use a .htaccess file, for others you may need to put a php.ini file into each script directory. See e.g.:
http://www.webdecker.de/artikel/items/php-ini-value-htaccess.html

Posted in PHP

PHP/MySQL/Linux: File upload, database import and access right problems

On an (Apache) web server one should establish a policy regarding access rights to the files of hosted (virtual) domains. Especially with respect to files transferred by FTP or by uploads for a browser. It was an interesting experience for me that uploading files during an Ajax communication with a PHP program and moving them to target directories with PHP’s “move_uploaded_file” can collide with such policies and lead to unexpected results.

The problem

I recently tested Ajax controlled csv file transfers from a browser to a web server with a subsequent loading of the file contents to a database via PHP/MySQL. The database import was initiated by PHP by executing the SQL command “LOAD DATA INFILE” on the MySQL server. This chain of processes worked very well:

The uploaded csv file is moved from PHPs $_FILES superglobal array (which works as an input buffer for uploaded files) to a target directory on the web server by the means of the PHP function “move_uploaded_file”. My PHP program – the Ajax counterpart on the server – afterwards triggers a special MySQL loader procedure via the “LOAD DATA INFILE” command. MySQL then loads the data with very high speed into a specified database table. It is clear that the import requires sufficient database and table access rights which have to be specified by the PHP program when opening the database connection via one of PHP’s MySQL interfaces.

The overall success of the upload and database import sequence changed, however, in a surprising way when I wanted to transfer my (rather big) files in a zip-compressed form from the browser to the web server.

So I compressed my original csv file into a zip container file. This zip file was transferred to the server by using the same a web site formular and Ajax controls as before. On the server my PHP program went through the following steps to make the contents of the zip container available for further processing (as the import into the database) :

  • Step 1: I used “unlink” to delete any existing files in the target directory.
  • Step 2: I used “move_uploaded_file” to save the zip file into the usual target directory.
  • Step 3: I used the “ZipArchive” class and its methods from PHP to uncompress the zip-file content within the target directory.

Unfortunately, the “LOAD DATA INFILE” command failed under these conditions.

However, everything still worked well when I transferred uncompressed files. And even more astonishing:
style=”margin-left:20px;”>When I first uploaded the uncompressed version and then tried the same with the zip-version BUT omitted Step 1 above (i.e. did not delete the existing file in the target directory) “LOAD DATA” also worked perfectly.

It took me some time to find out what happened. The basic reason for this strange behavior was a peculiar way of how file ownership and access rights are handled by “move_uploaded_file” and by the method ZipArchive::extractTo. And in a wiggled way it was also due to some naivety on my side regarding my established access right policy on the server.

Note in addition: The failures took place although I used SGID and setfacl policies on the target directory (see below).

User right settings on my target directory

On some of my web servers – as the one used for the uploads – I restrict access rights to certain directories below the root directories of some virtual web domains to special user groups. One reason for this is the access control of different developer groups and FTP users. In addition I enforce automatic right and group settings for newly created files in these directories by ACL (setfacl) and SGID settings.

The Apache process owner becomes a member of the special group(s) owning these special directories. Which is natural as PHP has to work with the files.

Such an access policy was also established for the target directory of my file uploads. Let us say one of these special groups would be “devops” and our target directory would be “uploads”. Normally, when a user of “devops” (e.g. the Apache/PHP process)

  • creates a file in the directory “uploads”
  • or copies a file to the directory “uploads”

the file would get the group “devops” and “-rw-rw—-” rights – the latter due to my ACL settings.

What does “move_uploaded_file” do regarding file ownership and access rights ?

The first thing worthwhile to note is that the politics for what this PHP function does regarding ownership settings and access rights changed at some point in the past. It furthermore seems to depend on the fact whether the target directory resides on a different (mounted) file system. See the links given at the bottom of this file for more information.

Some years ago the right settings for the moved file in the target directory were “0600”. This obviously has changed :

  • The right settings of moved files today are “0644” (independent of file system questions).

However, what about the owner and the group of the moved file? Here “move_uploaded_file” seems to have it’s very own policy:

  • It sets the file owner to the owner of the Apache web server process (on my Opensuse Linux to “wwwrun“).
  • It sets the group always to “www” – and it does so
    independent of

    • whether the SGID sticky bit is set for the target directory or not (!) ,
    • whether the Apache web server process owner really is a member of “www” (!) ,
    • whether a file with the same name is already existing in the target directory with a maybe different group !

Funny, isn’t it? Test it out! I find the last 3 facts really surprising. They do not reflect a standard copy policy. As a result in my case the uploaded file always gets the following settings when saved to the file system by “move_uploaded_file”:

owner “wwwrun” and group “www” and the access rights “0644”.

So, after the transfer to the server, my target file ends up with being world readable!

What does ZipArchive do with respect to ownership and access rights ?

As far as I have tested “ZipArchive::extractTo” I dare say the following: It respects my sticky bit SGID and ACL settings. It behaves more or less like the Linux “cp” command would do.

So, when ZipArchive has done it’s extraction job the target file will have very different settings:

owner “wwwrun”, BUT group “devops” and the access rights “0660”.

However, this is only the case if the target file did not yet exist before the extraction !
If a file with the same name already existed in the target directory “ZipArchive::extractTo” respects the current owner/group and access rights settings (just as the cp command would have done!). Test it out!

The impact on MySQL’s “LOAD DATA INFILE …”

The MySQL process has its own owner – on my system “mysql”. When PHP issues the SQL command “LOAD DATA INFILE” by one of it’s MySQL interfaces the MySQL engine uses an internal procedure to access the (csv) file and loads it’s content into a database table. You may rightfully conclude that the opening of the file will be done by a process with “mysql” as the owner.

nSo, it becomes clear that not only the rights of the Apache/PHP process or database access rights are important in my scenario described above:

The MySQL (sub-) process itself must have access rights to the imported file!
 
But as we have learned: This is not automatically the case when the ZipArchive extraction has finalized – if “mysql” is not by accident or purpose a member of the group “devops”.

Now, take all the information given above together – and one understands the strange behavior:

  • When I uploaded the uncompressed file it got rights such that it was world readable and it’s contents could therefore be accessed by “mysql”. That guaranteed that “UPLOAD DATA INFILE” worked in the first standard scenario without a zip file.
  • If the target file is not deleted and exists before a zip extraction process it will be rewritten without a change of it’s properties. That makes the “LOAD DATA INFILE” work also after a zip extraction as long as we do not delete a previously existing target file from a standard upload with the same name and world readability.
  • However, in case of an emptied target directory the extraction process respects the SGID and ACL settings – and then “mysql” has no read access right for the file!

A conclusion is that I had stumbled into a problem which I had partially created by myself:

I had established a policy for group and right settings which was respected when extracting zipped files. This policy collided with required file access rights for the MySQL processes. Stupid me! Should have taken that into account !

Accidentally, my politics was not respected when uploading and handling standard files directly without encapsulating it in a zip container. Due to the failure of my policy and the disregard of special right settings by “move_uploaded_file” the subsequent MySQL process could use “LOAD UPLOAD INFILE”. Not what I wanted or had expected – but the policy violation almost went undetected due to the overall success.

Conclusions

I hope that the points discussed in this article have made it clear that the access rights of uploaded and/or zip-extracted files are something to take good care of on a Linux web server:

Do not rely on any expectations regarding of how PHP functions and objects may handle user/group ownerships and access rights! In case of file uploads (with and without zip containers) test it out, check the behavior of your programs and set the rights with available PHP file handling functionality explicitly to what to want to achieve.

The consequences of the peculiar rights settings of “move_uploaded_file” and it’s ignorance of SGID/ACL policies must be taken into account. Other PHP routines may handle access rights of copied/moved/extracted files differently than “move_uploaded_file”. Furthermore, if you want to load data from uploaded (csv) files into the database take into account that the “mysql” user needs read access rights for your file.

Solutions for my scenario with zip files

Let us assume that I follow my advice above and set the rights for files saved by “move_uploaded_file” explicitly, namely such that my policy with the group “devops” and the “0660” rights is respected. Than – without further measures – no uploaded file could be handled by the MySQL “UPLOAD DATA INFILE” mechanism. In my special case one could think about several possible solutions. I just name 3 of them:

Solution 1: If – for some reasons – you do not even temporarily want to break your SGID/ACL policies for the uploaded and extracted file in your target directory, you could make “mysql” a member of your special group (here “devops”). Note that this may also have some security
implications you should consider carefully.

Solution 2: You may set up another target directory with special access rights only to “mysql” and “wwwrun” by defining a proper group. This directory would only be used to manage the data load into the database. So, copy or move your uploaded/extracted file there, set proper rights if necessary and then start the “UPLOAD DATA” process for this target file. And clean the special directory afterwards – e.g. by moving your upload file elsewhere (for example to an archive).

Solution 3: You may change the access rights and/or the group of the file temporarily by the PHP program before initiating the MySQL “LOAD DATA” procedure. You may reset rights after the successful database import or you may even delete or move the file to another directory (by rename) – and adjust its rights there to whatever you want.

Whatever you do – continue having fun with Ajax and PHP controlled file uploads. I shall come back to this interesting subject in further articles.

Links

New access right settings of move_uploaded_file
https://blog.tigertech.net/posts/php-upload-permissions/
Old access right settings by move_uploaded_file and ignorance of SGID sticky bit
http://de1.php.net/manual/en/function.move-uploaded-file.php#85149
cp and access rights (especially when a file already exists in the target directory)
http://superuser.com/questions/409226/linux-permissions-and-owner-being-preserved-with-cp
The use of SGID
http://www.library.yale.edu/ wsg/docs/ permissions/ sgid.htm

 

PHP/OO/Schemata: Decoupling of property groups by using composite patterns

When designing and building web applications with OO techniques the following elementary questions come up:

  • How to represent real world objects [RWOs] in the PHP OO code of Web applications by PHP objects?
  • How and where to define and control hierarchical relations between different object classes?
  • How and where to define the relation of the object properties to database table and column definitions of a RDBMS?
  • How to define rules for presentation and web template controller objects and/or web generator objects ?

As a common answer I often use a kind of “SCHEMA” oriented approach. By a SCHEMA on the programming level I understand a container of information that helps to control objects of an application and their properties in a flexible, yet standardized way. So, I am not talking about database schemata – I refer to SCHEMATA describing and controlling the data structure, the behavior, properties and certain methods of OO objects (OO Schemata). Such an object control SCHEMA may, however, refer to and use information coded in a relational database schema. Actually, important parts of a SCHEMA support the objectives of a classical data mapper pattern (OO properties vs. database table fields). But SCHEMATA in my own application framework contain more and other logical control information and settings than just data mapping definitions (see below).

When many object relations and properties are involved the maintenance of multiple OO SCHEMATA may get painful. This article is about the idea that the use of decoupled Schemata for property groups and a rigorous use of composite patterns in the object design may make things easier.

In the following text we call a PHP object instance that represents a RWO a “PWO”. The PWO object properties are of course defined in the PWO class definition. What kinds of PWO classes are required is defined by the results of a object oriented analysis combined with an ER model. A specific PWO class may be derived from a general (framework) class defining “Single” [SGL] objects that shall work as PWOs (and represent a single RWO each) plus the methods required for handling a PWO and its property data.

Schemata to encapsulated structural data knowledge for (PHP) PWO objects

A PWO may represent a RWO of a certain type like a free position, a product, a contract, an employee or a web page. We need classes of PWOs to cover all required RWO types. For each class of PWOs that appear in a new PHP application I define a specific PHP SCHEMA object class (derived from Schema base classes). I use such a PWO Schema class to define – among other things – e.g.

  • all required properties of a PWO object class, associated data types, value ranges, default values, criteria for must and (un/)changeable properties by appropriate arrays,
  • schematic control parameters for other application objects that deal with the data,
  • associated database tables/views of a RDBMS,
  • the relation of properties to database fields and index definitions,
  • the relation of properties to $_POST oder $_SESSION data arrays,
  • standard or Null value settings for all fields,
  • typical SQL fragments for a variety of database interactions,
  • all required Master-Detail [MD] relations to other object classes and associated tables/columns,
  • possible constraints for deletions,
    inserts, updates to maintain data and relation integrity,
  • constraints regarding unique values within relations.

Most of the property/field information is typically kept in arrays. The definitions can be provided directly in the class definition field or indirectly by fetching information from special database tables which an application designer may fill. Normally, only one Schema object instance – a PWO class specific SCHEMA object – is generated for each type of PWOs during the run time of a PHP application program. SCHEMA objects of a specific class are typical singletons.

The ultimate goal of using Schema definitions in an application development framework is:

If you once have defined a Schema consistently with your database most – if not all – application features and methods to maintain the PWO objects and their underlying data should be ready to run. The object constructor uses the SCHEMA definitions to prepare and fill the objects properties.
In addition a SCHEMA provides a very flexible and dynamic way to change the property list of object classes.

As a central part of this task the Schema information relates (PHP) PWO object properties to database tables and their field/columns (of a RDBMS) in the sense of a “mapping“. Such tasks are theoretically well covered by the use of “Data Mapper” patterns as a bridge between the OO and the RDBMS worlds. A SCHEMA is – among other things – a concrete Data Mapper object. It defines what PWO property corresponds to which database table column. (It may also define the property and data type, value limits changeability, and so on). If you for a moment neglect object relations (and related table relations in the database), in very many simple applications RWO properties and related PWO properties are mapped to the fields of exactly one database table.

However, in case of complex applications you may also have to define parameters describing Master-Detail [MD] relations or other types of relations to other object classes or other external MD hierarchies. For complex applications we furthermore will add parameters defining which properties shall or shall not appear in which order, in which PWO single or PWO list representation, in MD views, in maintenance masks and so on.

Especially list definitions (What fields appear? From which joined tables? In which order?) and definitions for which fields should appear in which order in maintenance masks or public web pages add complexity to a SCHEMA. Moreover, for the purposes of an application meta framework, I provide a lot of parameters to control presentation tasks both for automatically providing maintenance masks for whole object hierarchies and controlling the methods of web page generator objects.

Therefore, in a complicated application environment, a SCHEMA object can be a very complex thing in itself – it must be modified with care and there should always be defined methods to guarantee consistency conditions – e.g.

  • internally between definitions inside a SCHEMA,
  • between SCHEMATA for object classes defined on different levels of a MD hierarchy
  • and of course between the SCHEMA definitions and the tables in the database.

Note that SCHEMA definitions in application developer meta frameworks can also be used to generate tables for new applications (if they are not present yet) and/or to change existing tables and/or to validate the real database tables against what the PHP object representation in a program expects.

A typical (web) application will of course require multiple distinct SCHEMATA for the definition of different PWO classes.

Typically the appropriate SCHEMA object for a PWO or an object describing lists of PWOs will be loaded or injected
into a PWO instance during the creation of this instance by a constructor function. [An alternative to an external injection into the constructor could of course be an automatic identification and loading of the already created singleton Schema object by static methods of a classical Singleton pattern in PHP.]

A major disadvantage of a Schema oriented approach is:

If you make errors when maintaining the Schema definitions your application(s) probably will not run anymore. The reason of course is that Schema objects transport a substantial, crucial knowledge about the PWO object(s) and the database structure as well as object relations and table relations. This knowledge is used at many places in your application methods.

Maintainability of SCHEMA definitions ?

The idea of encapsulating extensive knowledge about data structures in a SCHEMA may lead to maintainability problems. In my experience, a SCHEMA based approach works very well and relatively effortless if

  • the number of properties/fields is limited,
  • selection conditions for defining which fields appear are limited,
  • when sequence and ordering conditions (i.e. the order of fields or conditions on the order of field appearances in representation objects) are simple and also limited
  • the relations to other object classes are simple and the number of relations is limited
  • your application, object and and database design is finalized already.

But, whenever the number of properties of RWOs/PWOs gets big, or large groups of logically connected properties appear and/or the relations to other objects get numerous and complicated then the effort to change the settings in the interconnected SCHEMATA for several PWO classes may get a painful and a very time consuming task.

Especially the insertion of new object properties at some positions in defined sequence of properties may lead to a lot of manual adjustments a programmer may have to do in different affected Schemata. E.g., a renumbering of many array elements may happen. Changing the database tables may be much less of an effort than changing all affected Schemata in a complex application environment (to be developed).

In parallel to the number of properties the number of columns in a database table may get big – so that even handling the tables with tools like phpMyAdmin may also become a bit difficult. Note that in a CMS like application the variables (properties) describing the position and look of page elements may easily be more than hundred.

Typically, the last condition of my criteria list above will not be met during development phases – especially not in agile projects. Tests and new insights may lead to continuous modifications of data and object models. Resulting Schema adaptions to changes in database or object models may happen very often during some development phases. And then time is a more crucial factor than developers may wish it to be.

Some SCHEMA edit actions like renumbering of arrays or adjusting property/field order definitions will feel like a waste of time especially in cases when only a special group of related properties is the main target of the changes whilst other property groups could be left alone – at least in principle – if things were decoupled … . However, this is not the case if all properties of all property groups are sequentially listed and probably enumerated in one and the same SCHEMA object. And believe me – some enumeration has to be done – you cannot cover all and everything by associative arrays.

To enhance the maintainability of SCHEMATA in vast MD applications I have used a different approach for some time now,
which may also be of interest for other developers. Actually – as I am originally a physicist and programming is not even my main profession within IT – I am not sure whether the following approach has been described elsewhere. Probably it has. There is a big overlap with the well known general “Composite Pattern” in OO design – however, it is the relation to SCHEMA objects that is of interest here. My goal is to fold a composite pattern for SCHEMATA into a related composite pattern for PWOs. For myself it marked a bit of a change of how I incorporate structural information into complex PHP/MySQL applications like a CMS.

SLAVE Schemata and SLAVE Objects

I summarize my approach which leads to a split of tables and Schema definitions by the words “Slave Schemata and Slave Objects”.

By the word “SLAVE” I do not refer to Master-Detail relations. I neither refer to Master-Slave structures in a database or LDAP server environment. To mark the difference I shall below call the “master” of a SLAVE SCHEMA the “MAIN SCHEMA“. Nevertheless SLAVEs mark a level in a new hierarchy of a composite pattern as I explain below.

I have three main objectives:

  • The first objective of this approach is to decouple the conventional definitions of property and data groups in a classical Schema class from each other and encapsulate each data group definition in a separate SCHEMA object, i.e. in a separate SLAVE SCHEMA class definition.
  • The second objective is that – instead of comprising all data fields for object properties in just one database table – we will distribute the fields over several separate database tables – one for each group of properties.
  • The third objective is to make the resulting MAIN/SLAVE-SCHEMA structure and the distribution of data over several tables usable for (restructured) PWOs – without having to reprogram basic methods already available to PWOs by inheritance from some Base classes (of a framework). This will lead us to the definition of SLAVE objects.

SLAVE SCHEMATA and distinct database tables for property groups

Please note, that the rearrangement of data over several database tables is NOT done because of reasons like redundancy reduction or to get a better ER model. On the contrary, we shall need additional efforts in our object methods to gather all property information from separate tables. The whole effort is done to enhance the maintainability of our SCHEMATA. Nevertheless, reflecting the logical association of data groups by separate distinct tables may help to deal better with associative structures – although we even get a bit more redundancy.

At its core a SCHEMA defines the relation between database table fields and PWO properties. If we want to split the properties of PWOs of a defined class into groups and distribute these groups into separate database tables we need of course multiple SCHEMATA and a related definition of several SCHEMA classes.

As an example let us assume that we have a RWO/PWO (like an art picture) which shall be described

  • by some basic geometrical information (as e.g. an art picture canvas) (property group “geo”)
  • and by some standard maintainable CMS parameters determining the presentation structure of text information and illustrating pictures on a web page (property group “cms”).

Then we may identify two groups of properties “geo” and “cms” of such a PWO. We could define the properties and their mapping to database fields in 2 Schemata – one for the “geo” group of object properties and one for the “cms” group.
r

However, we also need a kind of “Main Schema” to bind these (Sub-) Schemata together. Following this idea we get a new hierarchy – in addition to a potentially already existing logical and hierarchical Master-Detail relation between different classes of PWO objects (e.g. a picture-artist relation). But this time we deal just with a grouping of data in “rooms” of a house under one “roof”. So, this new hierarchy of SCHEMATA only has two levels:

  • A MAIN SCHEMA – which defines as usual basic properties of a specific PWO object class and the MD hierarchy relations or other relations to object instances of other PWO classes (besides other things)
  • Multiple SLAVE SCHEMATA – each describing a special group of semantically connected properties of its related specific PWO class.

Note that we will define basic logical and fundamental relational aspects of a PWO in the MAIN SCHEMA. A SLAVE SCHEMA contains information about some standard properties confined in a group. Each SLAVE SCHEMA describes the relation of the properties of its associated group with the fields of a distinguished SLAVE database table – separate from the MAIN table and other SLAVE tables.

All properties described in the MAIN SCHEMA object and it’s included SLAVE SCHEMA objects together define a complete set of properties of a PWO instance.

To logically bind associated records together it is clear that the key values identifying the associated records both in the Main table and the SLAVE tables must have a common value – identifying a PWO record and its respective object instance in a PHP program. We call this value of the unique record identification key “snr” below.

The MAIN SCHEMA has of course to comprise some knowledge about the SLAVE Schemata. In an MD application, we may in addition have to define one MAIN Schema and several SLAVE Schemata on each of the MD levels.

The basic idea regarding an improved efficiency of the maintenance of SCHEMATA is:

If you have to change some properties/fields – just and only change the affected SLAVE or the Main SCHEMA describing the related group of fields – but leave all other SCHEMATA and their definition statements unchanged !

SLAVE SCHEMA objects are created inside the MAIN SCHEMA object

As a Main Schema and its Slave Schemata strongly belong together and all the Slave Schemata depend on the Main Schema my approach was to create the SLAVE Schema object instances as sub-objects of the MAIN Schema object instance. I.e., I followed the idea of a composite pattern:

The MAIN SCHEMA object – a Singleton – acts as a container object for its SLAVE SCHEMA object instances. It generates, contains and controls SLAVE SCHEMATA as sub-objects (i.e. as complicated variables). We can save this knowledge in array-like structures for the sub-objects. The SLAVE Schemata can e.g be arranged in an associative array with indices defined in an array “ay_slave_prefs[]” of the Main Schema – containing name prefixes for each of the (Slave) property groups (e.g. “geo” and “cms”).

The MAIN SCHEMA objects of an MD application and their encapsulated SLAVE SCHEMA objects should of course be instances of the same type of general Base Classes for Schema objects. We want to use as much of the methods already defined for Schema objects as possible. SLAVE Schemata are basically Schemata after all!

Nevertheless, in such an approach we would still have to adapt or introduce some methods to deal with the hierarchical structure and the association with another – e.g. in special requirements for consistency checks between a SLAVE Schema and its MAIN Schema and other things. However, this is an easy task.

nFurthermore, each SLAVE SCHEMA object should to receive a reference to its MAIN SCHEMA object as an injected parameter to be able to create and handle further references to all the variables of the MAIN Schema. So the MAIN Schema object will contain SLAVE Schema objects – each of which itself comprises a reference to their common container object, namely the MAIN SCHEMA object.

Note, that there is still only one MAIN SCHEMA object comprising all relevant property and relation information for each PWO class.

The following drawing should make the basic ideas described above clear:

Slave_schemata

SGL PWO objects representing a single RWO become internally structured by SLAVE PWO objects

Ok, so far we grasped the idea of a kind of an array of SLAVE Schemata within a MAIN SCHEMA. Each SLAVE SCHEMA describes a bunch of properties of a PWO object class. The values of these properties are saved in a distinct database table. The MAIN SCHEMA keeps everything together and defines also the MD relations and other relations of a PWOs of different PWO classes.

At the core of PHP (web) applications, however, we need (structured) PWO objects with methods to handle all of the user’s interactions with all property data.

In my application meta framework a PWO class is realized by deriving it from a base class for so called Single [SGL] objects. The task of a SGL object class is to provide a general basis for specific PWO classes. A PWO instance is also a SGL instance and has all of its methods available. We speak of a “SGL PWO” object below. (An application class family comprises also other types of objects as e.g. LIST objects or template control objects).

A SGL PWO object is derived from some SGL Base Classes (of an inheritance chain) with all required methods e.g. to handle database transactions, to check field contents and to handle complex object relations like MD hierarchy relations of a potential PWO. It does this by extensively using the PWO SCHEMA information in its methods.

But, how to deal with our new type of a MAIN SCHEMA that contains information distributed over several SLAVE SCHEMATA? How would a PWO use it? And:

Do we need to rewrite all the base class methods for SGL PWOs to handle the database interactions because we have divided the PWO properties into distinct groups (saved in distinct database tables).
Fortunately, the answer is NO!

A SGL PWO object in my framework e.g. identifies its appropriate SCHEMA object by following name space rules and then loads the Schema as a sub object (see the graphics). It does this by injecting the PWO Schema object into the constructor of its most elementary base class in the inheritance chain. So, basically a PWO gets its Schema object injected. (I have only automatized this process by name rules and the use of Singleton patterns). A PWO deals with its data by using the knowledge of its injected SCHEMA.

Therefore, we can choose a similar approach for our SGL PWO objects as for the Schemata:

A SGL object making a specific PWO instance becomes a MAIN SGL PWO object. It will create and contain SLAVE SGL PWOs which are derived from the very same base classes as the MAIN SGL object itself. So, we use a kind of composite pattern also for the SGL PWO object:
 
The Main SGL PWO object acts as a container for the SLAVE SGL PWOs. Each SLAVE SGL object shall be responsible for the representation of a property group defined in a SLAVE Schema. And now comes the real power of
this intertwined double composite pattern approach:
 
The properties of each SLAVE SGL object correspond to fields of the table/views defined in the related SLAVE Schema, only! To use that on the programming stage we only have to create the SLAVE SGL PWO objects the same way as the MAIN SGL PWO object – BUT with an injection of the relevant SLAVE Schema instead of the Main Schema!

See the drawing above. As in the case of the SCHEMATA we create each SLAVE SGL object with a reference to its MAIN SGL object. Each SLAVE SGL object therefore knows about the identification key of the MAIN SGL object (and its rows in the database tables) and can therefore use it to identify its own records in the SLAVE tables of the database (defined in the SLAVE Schema). Remember that we defined the key value to be the same for associated records in the MAIN and SLAVE tables.

Provided that the right SLAVE SCHEMA was injected into each SLAVE SGL PWO , all base class methods that work for the MAIN SGL object regarding database operations will also work for SLAVE SGL objects. The correct identification of the right record in the associated SLAVE tables is guaranteed if each SLAVE object gets the same value “snr” for its identification key as its MAIN container object (see above). That should be straightforward to understand and can be guaranteed by the constructor functions. As a result SLAVE objects and their methods work autonomously on their SLAVE tables just as the MAIN SGL object works on its main table.

Recursive iteration of methods

All what we said above means in addition that we are able to iterate all the methods, a SGL MAIN object uses for data handling, also over it’s SLAVE SGL objects:

The thing to be guaranteed is that every UPDATE, DELETE, INSERT method for MAIN SGL object automatically triggers the execution of the very same (base class) methods for the SLAVE objects. This requires a rather simple method extension in the base classes. Actually, we could define each of the elementary methods in form of a recursion following a hierarchy:

If SLAVE objects exist call the method presently used for the MAIN SGL PWO object for each of the SLAVE objects, too.

We may stop the recursion at the level of the SLAVE objects without trying a further iterate over non existing deeper SLAVE level by evaluating some special property of a SLAVE Schema describing that the SLAVE has no SLAVE Schemata incorporated itself. (However, also the iteration over tree like object structures would be possible – although not required in my case.)

Remark regarding error handling of database transactions

For error handling we need further method extensions controlling the success of the database operations over all (MAIN and SLAVE) tables. In case of failures in a SLAVE table all other already performed transactions on other SAVE tables or the MAIN table have to be rolled back. To do this without appropriate mechanisms offered by the database the old values have to be saved in an intermediate storage. Otherwise database transaction control and rollback mechanisms could be used.

Hide or use the SLAVE structure from outside?

Note that regarding the interaction with a PWO from outside you have the choice

  • to adapt objects that use PWOs to work with their data or generate e.g. web pages to use they knowledge about the SLAVE structure
  • or to create interface methods that hide the internal structure.

My experience is that in case you build your own application development framework you will need both approaches. Note also that our SLAVE
approach will have an impact on objects and methods developed for representing LISTs of PWOs or MD views over several MD hierarchy levels of PWOs. But that is stuff for another article. In this article I just wanted to present the basics of the SLAVE SCHEMA and SLAVE PWO object approach.

Conclusion

Maintaining multiple SCHEMA definitions over hundreds of properties of RWO/PWOs can be dreadful. Splitting the properties into property groups and defining associated Sub-Schema-objects of a Main-Schema-object in the sense of a OO composite pattern can help to improve maintainability. This approach can be coupled with a composite pattern for the (SGL) PWO objects representing single RWOs. The SLAVE (SGL) PWO objects will be instantiated by using the same SGL base classes as for the MAIN PWO object (containing the SLAVE PWOs). The decoupling of data is guaranteed by the injection of right SLAVE Schema into a SLAVE PWO. Many methods can then be iterated over the Main/SLAVE objects and will then lead to a consistent database interaction.

In a forthcoming article I shall discuss the impact of a SLAVE object approach on (X)HTML generator methods in web applications. See:
PHP/OO/Schemata: SLAVE objects and (X)HTML generator methods

Posted in PHP

MySQL/PHP: LOAD DATA – import of large csv files – linearity with record number?

In my last article
Importing large csv files with PHP into a MySQL MyISAM table
I talked a little about my first positive experience with using the “LOAD DATA INFILE …” SQL statement in PHP programs to load csv files with millions of records. Some reader asked me whether the import period behaves linearly with the number of records. Actually, I am not completely sure. The reader has to wait for some more experiments on my side. Right now we are preparing tests for 1, 2, 5 and 10 million line csv files for a certain project I am involved in.

So far, we have passed tests with sequences of 1, 2 and 5 million line (= record) files. For those our results look pretty linear:

  • For 1 million records we needed ca. 7,5 seconds to import them into a MySQL MyISAM table (with index building).
  • For 2 million records we needed ca. 15 seconds.
  • For 5 million records we needed ca. 38 seconds.

The csv data files were originally 32 MB, 64 MB and 165 MB big. We transferred them as compressed zip-files to the server (size reduction by a factor of 4 in our case).

Addendum, 19.09.2014, regarding the absolute numbers for the import time:

I should have mentioned that the absolute numbers for loading times are a bit large. The reason for this is that some indices are defined on the table and that the index trees are built up during the loading process. Actually, one can load our 5 million records in below 6 seconds if no indices are defined on the table. Furthermore, one should separate index building completely from the loading process. Building the index in one step after all data have been loaded may take significantly less time than building it during the loading process – especially if non-unique indices are involved. See a forthcoming article
MySQL: LOAD DATA INFILE, csv-files and index creation for big data tables
about this interesting topic.

Furthermore, the measured time intervals included the following main actions

  • jQuery/PHP: transfer of a zip file from a control web page to the server via Ajax in a local network
  • PHP: saving of the transferred on a special directory on the server
  • PHP: unzipping the file there
  • PHP/MySQL: selecting information from control tables (control parameter)
  • PHP: checking present status of several target tables by various selects (select for certain consistency numbers not only from the filled table)
  • PHP/jQuery: return of an Ajax message to the browser confirming the successful file transfer to the server
  • jQuery/PHP: automatic start of a new Ajax transaction from the client with control parameters for the import and subsequent checks
  • PHP/MySQL: truncation of existing import table and dropping of views
  • PHP/MySQL: importing the data by a “LOAD DATA INFILE … ” SQL statement issued by the PHP program
  • PHP/MySQL: creating (3) views
  • PHP/MySQL: checking present status of several target tables by various selects (consistency numbers)
  • PHP/jQuery: Return of an Ajax message to the client – confirming the successful import and delivering consistency data

So, actually it is a bit difficult to
extract the real required time for the file import by “LOAD DATA” from this bunch of very different activities. However, the whole sequence should behave approximately linearly if everything works correctly. (If you wonder why we have this two fold Ajax transaction structure – wait for more articles in this blog to come.)

The rather linear behavior for our 1 to 2 million files is also confirmed by the following graphs

Graph for 1 million records

file_upload_1mio_mysql_cut

Graph for 2 million records

file_upload_2mio_mysql_cut

Graph for 5 million records

file_upload_5mio_mysql_cut

You can see the two fold SELECT structure resulting form the action sequence described above.

Please, note the systematic rise of intermediately used memory from around 700 MByte up to 1.3 GByte! I conclude that the linear behavior is valid only as long as the system has enough free RAM to deal with the data.

I shall supplement more measurement data when we deal with larger files. Our hope is that we can confirm a more or less linear behavior also for larger import files with more records (provided that we supply our virtualized server system with sufficient RAM). However, you can never be sure – when the files get really huge some other memory limitations somewhere on the server (by system or service configuration parameters) may reduce the loading performance drastically. We shall see …