Import Processes

General thoughts

There are several ways how to put data into Magento. One of these ways is the already existing flat-file import of Magento. Another way is to use WebAPI, which supports also bulk and asynchronous operations. Why do we need Pacemaker?

There are several pros and cons for the existing solutions as well as there are pros and cons for using Pacemaker. Let’s compare them.

Magento’s Flat File Import

Magento provides an implementation for flat file import. The file needs to be uploaded via the admin UI. The synchronous process validates and persists the data from the uploaded CSV file. In comparison to the WebAPI this approach can handle big amount of data pretty fast, but you need to configure your webserver not to abort this long-running request and accept big data POST methods.

The import will be executed instantly. Depending on the load of your server this could cause performance issues (for sure this depends on your infrastructure). The import will cause a data re-indexing if your indexer mode is "on schedule", which also harms server resources. Indexer mode "on save" will require a manual re-indexing, which also needs to be done after the import. The re-indexing will cause a cache invalidation. However, all these resource-consuming processes should be done in a controlled way and time frame.

Magento’s WebAPI

WebAPI is a good way to read, add and update data in Magento. But the WebAPI is not designed to handle a huge amount of data. The bulk and asynchronous WebAPI can handle big data, but not very perform. Depending on your infrastructure the update of a single product could take 1-3 seconds. For a catalog with 20k products, this would lead in processing time between 5 and 8 hours, but we face often shops with several millions of products.

The Pacemaker Import Approach

The basic idea behind Pacemaker is to decouple the third party system and user interactions with resource-consuming processes on Magento side. Therefore we are using Process Pipelines. To handle a huge amount of data we are using M2IF, which is a powerful import framework and many times faster than Magento’s flat file import.

But the import process at all is not just to put the data into the database. The pipelines take also aware of fetching, transforming and persisting the data as well as the post-import processes like re-indexing and cache invalidation. All these operations are parts of a big process chain and can be executed synchronous and asynchronous depending on your current system state.

Common Process Design

Usually, the import processes are designed as following

Trigger import process chain
e.g. periodically (daily), via web service notification, by observing the file system, etc.
Fetching data from source
e.g. reading files, calling WebAPIs, etc.
Transforming data to target format
e.g. executing own scripts, use external libraries, etc.
Execute the import
e.g. running M2IF library, etc.
Run indexers and invalidate caches
e.g. using Magento’s APIs

Pacemaker’s import pipelines by default provide an observer for the local filesystem, which triggers the pipeline initialization. The transformation step is for customization, since it depends on your data source, whether the files need to be transformed or not. Please refer to How to extend > Transform foreign import source to learn how to customize this step. By default, M2IF library is running the import and there are executors for Magento’s indexers and cache invalidation.

Import Files Observer

Since Pacemaker 1.2 the observation of the filesystem is designed as an own pipeline any more. It is part of the heartbeat process now. There it uses the techdivision/pacemaker-pipeline-initializer package to trigger the import pipelines, once the required files are present in the file system. Please refer to Components & Concepts > Pipeline Initializer for details.

What is a file bunch?

Since Pacemaker is using M2IF it is possible to split all import files into multiple files. And because Pacemaker is running attribute-set, attribute, category and product import in one pipeline a bunch could grow to a big number of files. All these files need the same identifier in the file name. This identifier is defined in the File Name Pattern configuration within this part of the regular expression (?P<identifier>[0-9a-z\-]*).

According to the default expression, the filenames need to be in the following pattern: <IMPORT_TYPE>-import_<BUNCH_IDENTIFIER>_<COUNTER>.<SUFFIX>. There are example files provided in Pacemaker packages, please refer to Run your first predefined import jobs. Of course, you can change the expression if necessary, just take care to define an identifier within the pattern.

Examples

The following files would result in one import pipeline because the identifier is the same for all files. Also, only the steps attribute and product import would be executed. Attribute-set and category import would be skipped because there are no files given.

- attribute-import_20190627_01.csv
- attribute-import_20190627.ok
- product-import_20190627_01.csv
- product-import_20190627_02.csv
- product-import_20190627_03.csv
- product-import_20190627.ok

The following files would result in two import pipelines, while the first bunch import all entities and the second bunch imports only product data.

- attribute-set-import_20190627-1_01.csv
- attribute-set-import_20190627-1.ok
- attribute-import_20190627-1_01.csv
- attribute-import_20190627-1.ok
- category-import_20190627-1_01.csv
- category-import_20190627-1.ok
- product-import_20190627-1_01.csv
- product-import_20190627-1_02.csv
- product-import_20190627-1_03.csv
- product-import_20190627-1.ok
- product-import_20190627-2_01.csv
- product-import_20190627-2_02.csv
- product-import_20190627-2_03.csv
- product-import_20190627-2.ok