Data Importers

A frequent need when integrating systems is to import (or export, depending on your perspective) data from one system to another. Rattail provdes a framework for this, which offers the following benefits:

  • “dry run” mode to check things out before committing changes
  • “warnings” mode which sends email with data diffs, e.g. when you expect no changes
  • adjustable “batch size” for grouping changes when submitting to local system
  • full command line support for above, plus “max” changes to apply, show progress, etc.
  • core code is optimized to run quickly, e.g. by fetching all data up-front
  • new importers may be created simply / cleanly / according to existing patterns
  • new importers may extend / replace core functionality as needed

The rest of this document aims to explain the concepts and patterns involved with the Rattail importer framework.

Todo

Add link for code / API docs here.

“Importer” vs. “DataSync”

Perhaps the first thing to clear up, is that while Rattail also has a “datasync” framework, tasked with keeping systems in sync in real-time, the “importer” framework is tasked with a “full” sync of two systems. In other words “datasync” normally deals with only one (e.g. changed) “host” object at a time, and will update the “local” system accordingly, whereas an “importer” will examine all host objects and update local system accordingly. Also, datasync normally runs as a proper daemon, whereas an importer will normally run either as a cron job or in response to user request via command line or web UI, etc.

Todo

Write datasync docs / link here.

To make things even more confusing, datasync can leverage an import handler / importer(s) where possible so that the same logic is executed for both “real-time sync” and “full sync” modes.

“Host” vs. “Local” Systems

From the framework’s perspective, all import tasks have two “systems” involved: one is dubbed “host” and refers to the source of external/new data; the other is dubbed “local” and refers to the target where existing data is to be changed. It is important to understand what “host” and “local” refer to as you will encounter those terms frequently in the documentation (and code).

Note that it is perfectly fine for the same “system” proper, to be used as both host and local systems within a given importer. Meaning, you can read some data from one system, and then write data changes back to the same system. This can be useful for applying business rules logic to “core” (e.g. customer) records as an asynchronous process after they are changed normally within UI or as part of EOD etc. Typical use though of course is for the host and local systems to be actually different systems.

The term “system” here doesn’t imply a database or anything in particular, really. All that is required of a “host system” is that it be able to provide data for the import; all required of a “local system” is that it be able to provide “corresponding data” (i.e. for comparison, to determine if an add/update/delete is needed) and/or be able to apply add/update/delete operations as requested. Therefore in practice either the “host” or “local” systems may be a database, web API, Excel spreadsheet, flat text file, etc.

Also, the host -> local data flow is not always strictly the case, for instance it sometimes is necessary to change the “host” system to reflect changes which were made in the “local” system (e.g. mark a host record as exported). The typical scenario of course is for only the “local” system to be changed.

Since all importers have this “host -> local” pattern, on the code level it is almost always the case that an importer will inherit from two base classes, one for the host side and another for the local. More on that later though.

“Importer” vs. “Import Handler”

Another important distinction within the framework itself, is that of the “importer” vs. “import handler”. Technically a single Importer contains the logic for reading data from host, and reading/changing data on local system, but specific to a single “data model” (e.g. products table) whereas an ImportHandler contains logic for the overall transaction (i.e. commit/rollback). Therefore a single import handler might “handle” multiple importers, e.g. one for products, customers etc., so that multiple data models might be updated within a single transaction.

Note however that even within these docs you will find the term “importer” thrown around more often, sometimes in the generic sense meant only to refer to the overall importer concept / framework / implementation. Hopefully when the distinction is important to be made within the docs, it will be.

Also note that in practice, the “handler” abstraction layer is not always strictly necessary; for instance you might need an importer to push new customer email addresses to an online mailing list, and it may have to use a web API which only supports one add per call. In other words you have only one “data model” to update, so you don’t need a handler to manage multiple importers, and the web API doesn’t support the commit/rollback approach because each change submitted, is committed at once. However the suggested approach is to stick with established patterns and use a handler; various other parts of the Rattail framework (command line, datasync) will expect one.

Making a new Importer

Okay then, you must be serious if you made it this far…

First step of course will be to identify the “host” and “local” systems for your particular scenario. For the sake of a simple example here we’ll assume you wish to import product data from your “host” point of sale system (named “MyPOS” within these docs) to your “local” Rattail system.

Note also that to make a new importer, you must have already started a project based on Rattail; this doc will not explain that process. The examples which follow assume this project is named ‘myapp’.

Todo

Write docs for starting a new Rattail project / link here.

File / Module Structure

With the host and local systems identified, you can now start writing code…but where to put it? Assuming you already have a Rattail-based project with package named ‘myapp’ and assuming you were adding a POS->Rattail importer, the suggestion would be to add the following files to your project:

myapp/
   __init__.py
   importing/
      __init__.py
      model.py
      mypos.py

This is just a suggestion really, although it is the author’s personal convention which has served him well. Another typical scenario might be where you wish to “export” data from Rattail->POS, in which case you might do something like this instead:

myapp/
   __init__.py
   mypos/
      __init__.py
      importing/
         __init__.py
         model.py
         rattail.py

The difference may be subtle, but the intended effect is for the model.py file to contain logic which targets the “local” side of the importer, while the “other” file (e.g. mypos.py in the first example, rattail.py in the second) would contain logic for the “host” side of the importer. This “other” file is also where the import handler would live, since ultimately both sides must be known for an importer to function.

The main advantage to this layout / structure is that a given model.py might be shared among various importers. For example rattail.importing.model defines all the natively-supported importer logic when targeting various Rattail data models on the local side. (So technically if you didn’t need to override any of that, you wouldn’t need to provide your own model.py in the POS->Rattail scenario.)

Note that in practice the __init__.py file for an importing package typically has (only) the following contents, for convenience:

from . import model

Define Import Handler

For the sake of a single example we’ll continue to assume a POS->Rattail import is desired. Given the above file structure, that means the file myapp/importing/mypos.py will contain the handler. Within that file you’ll need to add something like the following:

from rattail import importing
from rattail.gpc import GPC

from myapp.mypos.db import Session as MyPosSession, model as mypos


class FromPosToRattail(importing.FromSQLAlchemyHandler, importing.ToRattailHandler):
    """
    Handler for MyPOS -> Rattail import.
    """
    host_title = "MyPOS"
    local_title = "Rattail"

    def make_host_session(self):
        return MyPosSession()

    def get_importers(self):
        return {
            'Department':    DepartmentImporter,
            'Vendor':        VendorImporter,
            'Product'        ProductImporter,
        }

Note that the importers (dept/vend/prod) don’t exist yet; those will be defined next, within this same file. Also here you can again see the strong “host -> local” patterns within the handler.

Choosing the correct base class(es) will be important. Here, by inheriting from ToRattailHandler we don’t have to declare connection info for the “local” (target) system because that is provided by the parent. Similarly for the host/source side, the FromSQLAlchemyHandler provides the bulk of logic and all we really have to do is provide a session opened on our POS database. Depending on your needs you may or may not find existing base classes to make things easier on you, vs. having to code all that logic yourself (which is still rather minimal). Also in some cases you may only wind up needing one base class for your handler, instead of two (which is more typical).

Define Importers

Okay now for the fun part..right? Keeping with our example we’ll add 3 simple importers, for department, vendor and product data coming from the POS into Rattail. Since we’ll be targeting Rattail on the local side, we once again can leverage existing code so all we really have to do is describe the host data. So, within the same file to which you added the handler, do something like:

class FromPOS(importing.FromSQLAlchemy):
    """
    Base class for importers with MyPOS as host.
    """

class DepartmentImporter(FromPOS, importing.model.DepartmentImporter):
    """
    Import department data from MyPOS -> Rattail.
    """
    host_model_class = mypos.Department
    key = 'number'
    supported_fields = [
        'number',
        'name',
    ]

    def normalize_host_object(self, mypos_dept):
        return {
            'number': mypos_dept.id,
            'name': mypos_dept.name.strip(),
        }


class VendorImporter(FromPOS, importing.model.VendorImporter):
    """
    Import vendor data from MyPOS -> Rattail.
    """
    host_model_class = mypos.Vendor
    key = 'id'
    supported_fields = [
        'id',
        'name',
    ]

    def normalize_host_object(self, mypos_vend):
        return {
            'id': mypos_vend.code.strip(),
            'name': mypos_vend.name.strip(),
        }


class ProductImporter(FromPOS, importing.model.ProductImporter):
    """
    Import product data from MyPOS -> Rattail.
    """
    host_model_class = mypos.Product
    key = 'upc'
    supported_fields = [
        'upc',
        'description',
        'size',
    ]

    def normalize_host_object(self, mypos_prod):
        return {
            'upc': GPC(mypos_prod.barcode),
            'description': mypos_prod.name.strip(),
            'size': mypos_prod.unit_size.strip(),
        }

Todo

need to explain the above a bit more

Configure Command Line

You almost certainly will want to configure the command line for your new importer(s), as it requires very little effort while providing a fairly robust feature set. Of course, command line isn’t the only way an importer might be invoked, but it is by far the lowest-hanging fruit.

Assuming typical conventions for Rattail projects, we’ll assume you have a file at myapp/commands.py which contains various (sub)commands already. It really doesn’t matter where you place the command code, because the next step will be to register it at its current location. Generally you would just need to add something like the following:

from rattail import commands

class ImportMyPOS(commands.ImportSubcommand):
    """
    Import data from MyPOS to Rattail
    """
    name = 'import-mypos'
    description = __doc__.strip()
    handler_spec = 'myapp.importing.mypos:FromPosToRattail'

Then to register it, edit your project’s setup.py file and add something like the following:

setup(
    name = 'myapp',
    entry_points = {
        'rattail.commands': [
            'import-mypos = myapp.commands:ImportMyPOS',
        ]
    }
)

Then you must “install” your project in whatever way you typically might, e.g. probably with pip install -e in development, so that the entry point will be properly recorded. Once this has happened you will have added a new “subcommand” which may be invoked as rattail import-mypos. Again assuming typical conventions, you might then do this:

cd /srv/envs/myapp
sudo -u rattail bin/rattail -c app/myapp.conf import-mypos -h