wuttasync.importing.base

Data Importer base class

class wuttasync.importing.base.FromFile(config, **kwargs)[source]

Base class for importer/exporter using input file as data source.

Depending on the subclass, it may be able to “guess” (at least partially) the path to the input file. If not, and/or to avoid ambiguity, the caller must specify the file path.

In most cases caller may specify any of these via kwarg to the class constructor, or e.g. process_data():

The subclass itself can also specify via override of these methods:

And of course subclass must override these too:

input_file_path

Path to the input file.

input_file_dir

Path to folder containing input file(s).

input_file_name

Name of the input file, sans folder path.

input_file

Handle to the open input file, if applicable. May be set by open_input_file() for later reference within close_input_file().

close_input_file()[source]

Close the input file for source data.

Subclass must override to specify how this happens; default logic blindly calls the close() method on whatever input_file happens to point to.

See also open_input_file().

get_input_file_dir()[source]

This must return the folder with input file(s). It tries to guess it based on various attributes, namely:

Returns:

Path to folder with input file(s).

get_input_file_name()[source]

This must return the input filename, sans folder path. It tries to guess it based on various attributes, namely:

Returns:

Input filename, sans folder path.

get_input_file_path()[source]

This must return the full path to input file. It tries to guess it based on various attributes, namely:

Returns:

Path to input file.

open_input_file()[source]

Open the input file for reading source data.

Subclass must override to specify how this happens; default logic is not implemented. Remember to set input_file if applicable for reference when closing.

See also get_input_file_path() and close_input_file().

setup()[source]

Open the input file. See also open_input_file().

teardown()[source]

Close the input file. See also close_input_file().

exception wuttasync.importing.base.ImportLimitReached[source]

Exception raised when an import/export job reaches the max number of changes allowed.

class wuttasync.importing.base.Importer(config, **kwargs)[source]

Base class for all data importers / exporters.

So as with ImportHandler, despite the name Importer this class can be used for export as well. Occasionally it’s helpful to know “which mode” is in effect, mostly for display to the user. See also orientation and actioning.

The role of the “importer/exporter” (instance of this class) is to process the import/export of data for one “model” - which generally speaking, means one table. Whereas the “import/export handler” (ImportHandler instance) orchestrates the overall DB connections, transactions and invokes the importer(s)/exporter(s). So multiple importers/exporters may run in the context of a single handler job.

handler

Reference to the parent ImportHandler instance.

model_class

Reference to the data model class representing the target side, if applicable.

This normally would be a SQLAlchemy mapped class, e.g. Person for importing to the Wutta People table.

It is primarily (only?) used when the target side of the import/export uses SQLAlchemy ORM.

fields

This is the official list of “effective” fields to be processed for the current import/export job.

Code theoretically should not access this directly but instead call get_fields(). However it is often convenient to overwrite this attribute directly, for dynamic fields. If so then get_fields() will return the new value. And really, it’s probably just as safe to read this attribute directly too.

excluded_fields

This attribute will often not exist, but is mentioned here for reference.

It may be specified via constructor param in which case each field listed therein will be removed from fields.

property actioning

Convenience property which returns the value of wuttasync.importing.handlers.ImportHandler.actioning from the parent import/export handler.

allow_create = True

Flag indicating whether this importer/exporter should ever allow records to be created on the target side.

This flag is typically defined in code for each handler.

See also create.

allow_delete = True

Flag indicating whether this importer/exporter should ever allow records to be deleted on the target side.

This flag is typically defined in code for each handler.

See also delete.

allow_update = True

Flag indicating whether this importer/exporter should ever allow records to be updated on the target side.

This flag is typically defined in code for each handler.

See also update.

cached_target = False

This is None unless caches_target is true, in which case it may (at times) hold the result from get_target_cache().

caches_target = False

Flag indicating the importer/exporter should pre-fetch the existing target data. This is usually what we want, so both source and target data sets are held in memory and lookups may be done between them without additional fetching.

When this flag is false, the importer/exporter must query the target for every record it gets from the source data, when looking for a match.

can_delete_object(obj, data=None)[source]

Should return true or false indicating whether the given object “can” be deleted. Default is to return true in all cases.

If you return false then the importer will know not to call delete_target_object() even if the data sets imply that it should.

Parameters:
  • obj – Raw object on the target side.

  • data – Normalized data dict for the target record, if known.

Returns:

True if object can be deleted, else False.

create = None

Flag indicating the current import/export job should create records on the target side, when applicable.

This flag is typically set by the caller, e.g. via command line args.

See also allow_create.

create_target_object(key, source_data)[source]

Create and return a new target object for the given key, fully populated from the given source data. This may return None if no object is created.

This method will typically call make_empty_object() and then update_target_object().

Returns:

New object for the target side, or None.

data_diffs(source_data, target_data, fields=None)[source]

Find all (relevant) fields with differing values between the two data records, source and target.

This is a simple wrapper around wuttasync.util.data_diffs() but unless caller specifies a fields list, this will use the following by default:

It calls get_fields() to get the effective field list, and from that it removes the fields indicated by get_keys().

The thinking here, is that the goal of this function is to find true diffs, but any “key” fields will already match (or not) based on the overall processing logic and needn’t be checked further.

delete = None

Flag indicating the current import/export job should delete records on the target side, when applicable.

This flag is typically set by the caller, e.g. via command line args.

See also allow_delete.

delete_target_object(obj)[source]

Delete the given raw object from the target side, and return true if successful.

This is called from do_delete().

Default logic for this method just returns false; subclass should override if needed.

Returns:

Should return True if deletion succeeds, or False if deletion failed or was skipped.

do_create_update(all_source_data, progress=None)[source]

Import/export the given normalized source data; create and/or update target records as needed.

Parameters:
  • all_source_data – Sequence of all normalized source data, e.g. as obtained from normalize_source_data().

  • progress – Optional progress indicator factory.

Returns:

A 2-tuple of (created, updated) as follows:

  • created - list of records created on the target

  • updated - list of records updated on the target

This loops through all source data records, and for each will try to find a matching target record. If a match is found it also checks if any field values differ between them. So, calls to these methods may also happen from here:

do_delete(source_keys, changes=None, progress=None)[source]

Delete records from the target side as needed, per the given source data.

This will call get_deletable_keys() to discover which keys existing on the target side could theoretically allow being deleted.

From that set it will remove all the given source keys - since such keys still exist on the source, they should not be deleted from target.

If any “deletable” keys remain, their corresponding objects are removed from target via delete_target_object().

Parameters:
  • source_keys – A set of keys for all source records. Essentially this is just the list of keys for which target records should not be deleted - since they still exist in the data source.

  • changes – Number of changes which have already been made on the target side. Used to enforce max allowed changes, if applicable.

  • progress – Optional progress indicator factory.

Returns:

List of target records which were deleted.

property dry_run

Convenience property which returns the value of wuttasync.importing.handlers.ImportHandler.dry_run from the parent import/export handler.

get_deletable_keys(progress=None)[source]

Return a set of record keys from the target side, which are potentially eligible for deletion.

Inclusion in this set does not imply a given record/key should be deleted, only that app logic (e.g. business rules) does not prevent it.

Default logic here will look in the cached_target and then call can_delete_object() for each record in the cache. If that call returns true for a given key, it is included in the result.

Returns:

The set of target record keys eligible for deletion.

get_fields()[source]

This should return the “effective” list of fields which are to be used for the import/export.

See also fields which is normally what this returns.

All fields in this list should also be found in the output for get_supported_fields().

See also get_keys() and get_simple_fields().

Returns:

List of “effective” field names.

get_keys()[source]

Must return the key field(s) for use with import/export.

All fields in this list should also be found in the output for get_fields().

Returns:

List of “key” field names.

get_model_title()[source]

Returns the display title for the target data model.

get_record_key(data)[source]

Returns the canonical key value for the given normalized data record.

Parameters:

data – Normalized data record (dict).

Returns:

A tuple of field values, corresponding to the import/export key fields.

Note that this calls get_keys() to determine the import/export key fields.

So if an importer has key = 'id' then get_keys() would return ('id',) and this method would return just the id value e.g. (42,) for the given data record.

The return value is always a tuple for consistency and to allow for composite key fields.

get_simple_fields()[source]

This should return a (possibly empty) list of “simple” fields for the import/export. A “simple” field is one where the value is a simple scalar, so e.g. can use getattr(obj, field) to read and setattr(obj, field, value) to write.

See also get_supported_fields() and get_fields().

Returns:

Possibly empty list of “simple” field names.

get_source_objects()[source]

This method (if applicable) should return a sequence of “raw” data objects (i.e. non-normalized records) from the source.

This method is typically called from normalize_source_data() which then also handles the normalization.

get_supported_fields()[source]

This should return the full list of fields which are available for the import/export.

Note that this field list applies first and foremost to the target side, i.e. if the target (table etc.) has no “foo” field defined then it should not be listed here.

But it also applies to the source side, e.g. if target does define a “foo” field but source does not, then it again should not be listed here.

See also get_simple_fields() and get_fields().

Returns:

List of all “supported” field names.

get_target_cache(source_data=None, progress=None)[source]

Fetch all (existing) raw objects and normalized data from the target side, and return a cache object with all of that.

This method will call get_target_objects() first, and pass along the source_data param if specified. From there it will call normalize_target_object() and get_record_key() for each.

Parameters:
  • source_data – Sequence of normalized source data for the import/export job, if known.

  • progress – Optional progress indicator factory.

Returns:

Dict whose keys are record keys (so one entry for every normalized target record) and the values are a nested dict with raw object and normalized record.

A minimal but complete example of what this return value looks like:

{
    (1,): {
        'object': <some_object_1>,
        'data': {'id': 1, 'description': 'foo'},
    }
    (2,): {
        'object': <some_object_2>,
        'data': {'id': 2, 'description': 'bar'},
    }
}

get_target_object(key)[source]

Should return the object from (existing) target data set which corresponds to the given record key, if found.

Note that the default logic is able to find/return the object from cached_target if applicable. But it is not able to do a one-off lookup e.g. in the target DB. If you need the latter then you should override this method.

Returns:

Raw target data object, or None.

get_target_objects(source_data=None, progress=None)[source]

Fetch all existing raw objects from the data target. Or at least, enough of them to satisfy matching on the given source data (if applicable).

Parameters:
  • source_data – Sequence of normalized source data for the import/export job, if known.

  • progress – Optional progress indicator factory.

Returns:

Corresponding sequence of raw objects from the target side.

Note that the source data is provided only for cases where that might be useful; it often is not.

But for instance if the source data contains say an ID field and the min/max values present in the data set are 1 thru 100, but the target side has millions of records, you might only fetch ID <= 100 from target side as an optimization.

get_unique_data(source_data)[source]

Return a copy of the given source data, with any duplicate records removed.

This looks for duplicates based on the effective key fields, cf. get_keys(). The first record found with a given key is kept; subsequent records with that key are discarded.

This is called from process_data() and is done largely for sanity’s sake, to avoid indeterminate behavior when source data contains duplicates. For instance:

Problem #1: If source contains 2 records with key ‘X’ it makes no sense to create both records on the target side.

Problem #2: if the 2 source records have different data (apart from their key) then which should target reflect?

So the main point of this method is to discard the duplicates to avoid problem #1, but do it in a deterministic way so at least the “choice” of which record is kept will not vary across runs; hence “pseudo-resolve” problem #2.

Parameters:

source_data – Sequence of normalized source data.

Returns:

A 2-tuple of (source_data, unique_keys) where:

  • source_data is the final list of source data

  • unique_keys is a set of the source record keys

make_empty_object(key)[source]

Return a new empty target object for the given key.

This method is called from create_target_object(). It should only populate the object’s key, and leave the rest of the fields to update_target_object().

Default logic will call make_object() to get the bare instance, then populate just the fields from get_keys().

make_object()[source]

Make a bare target object instance.

This method need not populate the object in any way. See also make_empty_object().

Default logic will make a new instance of model_class.

normalize_source_data(source_objects=None, progress=None)[source]

This method must return the full list of normalized data records from the source.

Default logic here will call get_source_objects() and then for each object normalize_source_object_all() is called.

Parameters:
  • source_objects – Optional sequence of raw objects from the data source. If not specified, it is obtained from get_source_objects().

  • progress – Optional progress indicator factory.

Returns:

List of normalized source data records.

normalize_source_object(obj)[source]

This should return a single “normalized” data record for the given source object.

Subclass will usually need to override this, to “convert” source data into the shared format required for import/export. The default logic merely returns the object as-is!

Note that if this method returns None then the object is effectively skipped, treated like it does not exist on the source side.

Parameters:

obj – Raw object from data source.

Returns:

Dict of normalized data fields, or None.

normalize_source_object_all(obj)[source]

This method should “iterate” over the given object and return a list of corresponding normalized data records.

In most cases, the object is “singular” and it doesn’t really make sense to return more than one data record for it. But this method is here for subclass to override in those rare cases where you do need to “expand” the object into multiple source data records.

Default logic for this method simply calls normalize_source_object() for the given object, and returns a list with just that one record.

Parameters:

obj – Raw object from data source.

Returns:

List of normalized data records corresponding to the source object.

normalize_target_object(obj)[source]

This should return a “normalized” data record for the given raw object from the target side.

Subclass will often need to override this, to “convert” target object into the shared format required for import/export. The default logic is only able to handle “simple” fields; cf. get_simple_fields().

It’s possible to optimize this somewhat, by checking get_fields() and then normalization may be skipped for any fields which aren’t “effective” for the current job.

Note that if this method returns None then the object is ignored, treated like it does not exist on the target side.

Parameters:

obj – Raw object from data target.

Returns:

Dict of normalized data fields, or None.

property orientation

Convenience property which returns the value of wuttasync.importing.handlers.ImportHandler.orientation from the parent import/export handler.

process_data(source_data=None, progress=None)[source]

Perform the data import/export operations on the target.

This is the core feature logic and may create, update and/or delete records on the target side, depending on (subclass) implementation. It is invoked directly by the parent handler.

Note that subclass generally should not override this method, but instead some of the others.

This first calls setup() to prepare things as needed.

If no source data is specified, it calls normalize_source_data() to get that. Regardless, it also calls get_unique_data() to discard any duplicates.

If caches_target is set, it calls get_target_cache() and assigns result to cached_target.

Then depending on values for create, update and delete it may call:

And finally it calls teardown() for cleanup.

Parameters:
  • source_data – Sequence of normalized source data, if known.

  • progress – Optional progress indicator factory.

Returns:

A 3-tuple of (created, updated, deleted) as follows:

  • created - list of records created on the target

  • updated - list of records updated on the target

  • deleted - list of records deleted on the target

setup()[source]

Perform any setup needed before starting the import/export job.

This is called from within process_data(). Default logic does nothing.

teardown()[source]

Perform any teardown needed after ending the import/export job.

This is called from within process_data(). Default logic does nothing.

update = None

Flag indicating the current import/export job should update records on the target side, when applicable.

This flag is typically set by the caller, e.g. via command line args.

See also allow_update.

update_target_object(obj, source_data, target_data=None)[source]

Update the target object with the given source data, and return the updated object.

This method may be called from do_create_update() for a normal update, or create_target_object() when creating a new record.

It should update the object for any of get_fields() which appear to differ. However it need not bother for the get_keys() fields, since those will already be accurate.

Parameters:
  • obj – Raw target object.

  • source_data – Dict of normalized data for source record.

  • target_data – Dict of normalized data for existing target record, if a typical update. Will be missing for a new object.

Returns:

The final updated object. In most/all cases this will be the same instance as the original obj provided by the caller.

class wuttasync.importing.base.ToSqlalchemy(config, **kwargs)[source]

Base class for importer/exporter which uses SQLAlchemy ORM on the target side.

caches_target = True
get_target_object(key)[source]

Tries to fetch the object from target DB using ORM query.

get_target_objects(source_data=None, progress=None)[source]

Fetches target objects via the ORM query from get_target_query().

get_target_query(source_data=None)[source]

Returns an ORM query suitable to fetch existing objects from the target side. This is called from get_target_objects().