wuttasync.importing.base
¶
Data Importer base class
- class wuttasync.importing.base.FromFile(config, **kwargs)[source]¶
Base class for importer/exporter using input file as data source.
Depending on the subclass, it may be able to “guess” (at least partially) the path to the input file. If not, and/or to avoid ambiguity, the caller must specify the file path.
In most cases caller may specify any of these via kwarg to the class constructor, or e.g.
process_data()
:The subclass itself can also specify via override of these methods:
And of course subclass must override these too:
(and see also
input_file
)
- input_file_path¶
Path to the input file.
- input_file_dir¶
Path to folder containing input file(s).
- input_file_name¶
Name of the input file, sans folder path.
- input_file¶
Handle to the open input file, if applicable. May be set by
open_input_file()
for later reference withinclose_input_file()
.
- close_input_file()[source]¶
Close the input file for source data.
Subclass must override to specify how this happens; default logic blindly calls the
close()
method on whateverinput_file
happens to point to.See also
open_input_file()
.
- get_input_file_dir()[source]¶
This must return the folder with input file(s). It tries to guess it based on various attributes, namely:
- Returns:
Path to folder with input file(s).
- get_input_file_name()[source]¶
This must return the input filename, sans folder path. It tries to guess it based on various attributes, namely:
- Returns:
Input filename, sans folder path.
- get_input_file_path()[source]¶
This must return the full path to input file. It tries to guess it based on various attributes, namely:
- Returns:
Path to input file.
- open_input_file()[source]¶
Open the input file for reading source data.
Subclass must override to specify how this happens; default logic is not implemented. Remember to set
input_file
if applicable for reference when closing.See also
get_input_file_path()
andclose_input_file()
.
- setup()[source]¶
Open the input file. See also
open_input_file()
.
- teardown()[source]¶
Close the input file. See also
close_input_file()
.
- exception wuttasync.importing.base.ImportLimitReached[source]¶
Exception raised when an import/export job reaches the max number of changes allowed.
- class wuttasync.importing.base.Importer(config, **kwargs)[source]¶
Base class for all data importers / exporters.
So as with
ImportHandler
, despite the nameImporter
this class can be used for export as well. Occasionally it’s helpful to know “which mode” is in effect, mostly for display to the user. See alsoorientation
andactioning
.The role of the “importer/exporter” (instance of this class) is to process the import/export of data for one “model” - which generally speaking, means one table. Whereas the “import/export handler” (
ImportHandler
instance) orchestrates the overall DB connections, transactions and invokes the importer(s)/exporter(s). So multiple importers/exporters may run in the context of a single handler job.- handler¶
Reference to the parent
ImportHandler
instance.
- model_class¶
Reference to the data model class representing the target side, if applicable.
This normally would be a SQLAlchemy mapped class, e.g.
Person
for importing to the Wutta People table.It is primarily (only?) used when the target side of the import/export uses SQLAlchemy ORM.
- fields¶
This is the official list of “effective” fields to be processed for the current import/export job.
Code theoretically should not access this directly but instead call
get_fields()
. However it is often convenient to overwrite this attribute directly, for dynamic fields. If so thenget_fields()
will return the new value. And really, it’s probably just as safe to read this attribute directly too.
- excluded_fields¶
This attribute will often not exist, but is mentioned here for reference.
It may be specified via constructor param in which case each field listed therein will be removed from
fields
.
- property actioning¶
Convenience property which returns the value of
wuttasync.importing.handlers.ImportHandler.actioning
from the parent import/export handler.
- allow_create = True¶
Flag indicating whether this importer/exporter should ever allow records to be created on the target side.
This flag is typically defined in code for each handler.
See also
create
.
- allow_delete = True¶
Flag indicating whether this importer/exporter should ever allow records to be deleted on the target side.
This flag is typically defined in code for each handler.
See also
delete
.
- allow_update = True¶
Flag indicating whether this importer/exporter should ever allow records to be updated on the target side.
This flag is typically defined in code for each handler.
See also
update
.
- cached_target = False¶
This is
None
unlesscaches_target
is true, in which case it may (at times) hold the result fromget_target_cache()
.
- caches_target = False¶
Flag indicating the importer/exporter should pre-fetch the existing target data. This is usually what we want, so both source and target data sets are held in memory and lookups may be done between them without additional fetching.
When this flag is false, the importer/exporter must query the target for every record it gets from the source data, when looking for a match.
- can_delete_object(obj, data=None)[source]¶
Should return true or false indicating whether the given object “can” be deleted. Default is to return true in all cases.
If you return false then the importer will know not to call
delete_target_object()
even if the data sets imply that it should.- Parameters:
obj – Raw object on the target side.
data – Normalized data dict for the target record, if known.
- Returns:
True
if object can be deleted, elseFalse
.
- create = None¶
Flag indicating the current import/export job should create records on the target side, when applicable.
This flag is typically set by the caller, e.g. via command line args.
See also
allow_create
.
- create_target_object(key, source_data)[source]¶
Create and return a new target object for the given key, fully populated from the given source data. This may return
None
if no object is created.This method will typically call
make_empty_object()
and thenupdate_target_object()
.- Returns:
New object for the target side, or
None
.
- data_diffs(source_data, target_data, fields=None)[source]¶
Find all (relevant) fields with differing values between the two data records, source and target.
This is a simple wrapper around
wuttasync.util.data_diffs()
but unless caller specifies afields
list, this will use the following by default:It calls
get_fields()
to get the effective field list, and from that it removes the fields indicated byget_keys()
.The thinking here, is that the goal of this function is to find true diffs, but any “key” fields will already match (or not) based on the overall processing logic and needn’t be checked further.
- delete = None¶
Flag indicating the current import/export job should delete records on the target side, when applicable.
This flag is typically set by the caller, e.g. via command line args.
See also
allow_delete
.
- delete_target_object(obj)[source]¶
Delete the given raw object from the target side, and return true if successful.
This is called from
do_delete()
.Default logic for this method just returns false; subclass should override if needed.
- Returns:
Should return
True
if deletion succeeds, orFalse
if deletion failed or was skipped.
- do_create_update(all_source_data, progress=None)[source]¶
Import/export the given normalized source data; create and/or update target records as needed.
- Parameters:
all_source_data – Sequence of all normalized source data, e.g. as obtained from
normalize_source_data()
.progress – Optional progress indicator factory.
- Returns:
A 2-tuple of
(created, updated)
as follows:created
- list of records created on the targetupdated
- list of records updated on the target
This loops through all source data records, and for each will try to find a matching target record. If a match is found it also checks if any field values differ between them. So, calls to these methods may also happen from here:
- do_delete(source_keys, changes=None, progress=None)[source]¶
Delete records from the target side as needed, per the given source data.
This will call
get_deletable_keys()
to discover which keys existing on the target side could theoretically allow being deleted.From that set it will remove all the given source keys - since such keys still exist on the source, they should not be deleted from target.
If any “deletable” keys remain, their corresponding objects are removed from target via
delete_target_object()
.- Parameters:
source_keys – A
set
of keys for all source records. Essentially this is just the list of keys for which target records should not be deleted - since they still exist in the data source.changes – Number of changes which have already been made on the target side. Used to enforce max allowed changes, if applicable.
progress – Optional progress indicator factory.
- Returns:
List of target records which were deleted.
- property dry_run¶
Convenience property which returns the value of
wuttasync.importing.handlers.ImportHandler.dry_run
from the parent import/export handler.
- get_deletable_keys(progress=None)[source]¶
Return a set of record keys from the target side, which are potentially eligible for deletion.
Inclusion in this set does not imply a given record/key should be deleted, only that app logic (e.g. business rules) does not prevent it.
Default logic here will look in the
cached_target
and then callcan_delete_object()
for each record in the cache. If that call returns true for a given key, it is included in the result.- Returns:
The
set
of target record keys eligible for deletion.
- get_fields()[source]¶
This should return the “effective” list of fields which are to be used for the import/export.
See also
fields
which is normally what this returns.All fields in this list should also be found in the output for
get_supported_fields()
.See also
get_keys()
andget_simple_fields()
.- Returns:
List of “effective” field names.
- get_keys()[source]¶
Must return the key field(s) for use with import/export.
All fields in this list should also be found in the output for
get_fields()
.- Returns:
List of “key” field names.
- get_record_key(data)[source]¶
Returns the canonical key value for the given normalized data record.
- Parameters:
data – Normalized data record (dict).
- Returns:
A tuple of field values, corresponding to the import/export key fields.
Note that this calls
get_keys()
to determine the import/export key fields.So if an importer has
key = 'id'
thenget_keys()
would return('id',)
and this method would return just theid
value e.g.(42,)
for the given data record.The return value is always a tuple for consistency and to allow for composite key fields.
- get_simple_fields()[source]¶
This should return a (possibly empty) list of “simple” fields for the import/export. A “simple” field is one where the value is a simple scalar, so e.g. can use
getattr(obj, field)
to read andsetattr(obj, field, value)
to write.See also
get_supported_fields()
andget_fields()
.- Returns:
Possibly empty list of “simple” field names.
- get_source_objects()[source]¶
This method (if applicable) should return a sequence of “raw” data objects (i.e. non-normalized records) from the source.
This method is typically called from
normalize_source_data()
which then also handles the normalization.
- get_supported_fields()[source]¶
This should return the full list of fields which are available for the import/export.
Note that this field list applies first and foremost to the target side, i.e. if the target (table etc.) has no “foo” field defined then it should not be listed here.
But it also applies to the source side, e.g. if target does define a “foo” field but source does not, then it again should not be listed here.
See also
get_simple_fields()
andget_fields()
.- Returns:
List of all “supported” field names.
- get_target_cache(source_data=None, progress=None)[source]¶
Fetch all (existing) raw objects and normalized data from the target side, and return a cache object with all of that.
This method will call
get_target_objects()
first, and pass along thesource_data
param if specified. From there it will callnormalize_target_object()
andget_record_key()
for each.- Parameters:
source_data – Sequence of normalized source data for the import/export job, if known.
progress – Optional progress indicator factory.
- Returns:
Dict whose keys are record keys (so one entry for every normalized target record) and the values are a nested dict with raw object and normalized record.
A minimal but complete example of what this return value looks like:
{ (1,): { 'object': <some_object_1>, 'data': {'id': 1, 'description': 'foo'}, } (2,): { 'object': <some_object_2>, 'data': {'id': 2, 'description': 'bar'}, } }
- get_target_object(key)[source]¶
Should return the object from (existing) target data set which corresponds to the given record key, if found.
Note that the default logic is able to find/return the object from
cached_target
if applicable. But it is not able to do a one-off lookup e.g. in the target DB. If you need the latter then you should override this method.- Returns:
Raw target data object, or
None
.
- get_target_objects(source_data=None, progress=None)[source]¶
Fetch all existing raw objects from the data target. Or at least, enough of them to satisfy matching on the given source data (if applicable).
- Parameters:
source_data – Sequence of normalized source data for the import/export job, if known.
progress – Optional progress indicator factory.
- Returns:
Corresponding sequence of raw objects from the target side.
Note that the source data is provided only for cases where that might be useful; it often is not.
But for instance if the source data contains say an ID field and the min/max values present in the data set are 1 thru 100, but the target side has millions of records, you might only fetch ID <= 100 from target side as an optimization.
- get_unique_data(source_data)[source]¶
Return a copy of the given source data, with any duplicate records removed.
This looks for duplicates based on the effective key fields, cf.
get_keys()
. The first record found with a given key is kept; subsequent records with that key are discarded.This is called from
process_data()
and is done largely for sanity’s sake, to avoid indeterminate behavior when source data contains duplicates. For instance:Problem #1: If source contains 2 records with key ‘X’ it makes no sense to create both records on the target side.
Problem #2: if the 2 source records have different data (apart from their key) then which should target reflect?
So the main point of this method is to discard the duplicates to avoid problem #1, but do it in a deterministic way so at least the “choice” of which record is kept will not vary across runs; hence “pseudo-resolve” problem #2.
- Parameters:
source_data – Sequence of normalized source data.
- Returns:
A 2-tuple of
(source_data, unique_keys)
where:source_data
is the final list of source dataunique_keys
is aset
of the source record keys
- make_empty_object(key)[source]¶
Return a new empty target object for the given key.
This method is called from
create_target_object()
. It should only populate the object’s key, and leave the rest of the fields toupdate_target_object()
.Default logic will call
make_object()
to get the bare instance, then populate just the fields fromget_keys()
.
- make_object()[source]¶
Make a bare target object instance.
This method need not populate the object in any way. See also
make_empty_object()
.Default logic will make a new instance of
model_class
.
- normalize_source_data(source_objects=None, progress=None)[source]¶
This method must return the full list of normalized data records from the source.
Default logic here will call
get_source_objects()
and then for each objectnormalize_source_object_all()
is called.- Parameters:
source_objects – Optional sequence of raw objects from the data source. If not specified, it is obtained from
get_source_objects()
.progress – Optional progress indicator factory.
- Returns:
List of normalized source data records.
- normalize_source_object(obj)[source]¶
This should return a single “normalized” data record for the given source object.
Subclass will usually need to override this, to “convert” source data into the shared format required for import/export. The default logic merely returns the object as-is!
Note that if this method returns
None
then the object is effectively skipped, treated like it does not exist on the source side.- Parameters:
obj – Raw object from data source.
- Returns:
Dict of normalized data fields, or
None
.
- normalize_source_object_all(obj)[source]¶
This method should “iterate” over the given object and return a list of corresponding normalized data records.
In most cases, the object is “singular” and it doesn’t really make sense to return more than one data record for it. But this method is here for subclass to override in those rare cases where you do need to “expand” the object into multiple source data records.
Default logic for this method simply calls
normalize_source_object()
for the given object, and returns a list with just that one record.- Parameters:
obj – Raw object from data source.
- Returns:
List of normalized data records corresponding to the source object.
- normalize_target_object(obj)[source]¶
This should return a “normalized” data record for the given raw object from the target side.
Subclass will often need to override this, to “convert” target object into the shared format required for import/export. The default logic is only able to handle “simple” fields; cf.
get_simple_fields()
.It’s possible to optimize this somewhat, by checking
get_fields()
and then normalization may be skipped for any fields which aren’t “effective” for the current job.Note that if this method returns
None
then the object is ignored, treated like it does not exist on the target side.- Parameters:
obj – Raw object from data target.
- Returns:
Dict of normalized data fields, or
None
.
- property orientation¶
Convenience property which returns the value of
wuttasync.importing.handlers.ImportHandler.orientation
from the parent import/export handler.
- process_data(source_data=None, progress=None)[source]¶
Perform the data import/export operations on the target.
This is the core feature logic and may create, update and/or delete records on the target side, depending on (subclass) implementation. It is invoked directly by the parent
handler
.Note that subclass generally should not override this method, but instead some of the others.
This first calls
setup()
to prepare things as needed.If no source data is specified, it calls
normalize_source_data()
to get that. Regardless, it also callsget_unique_data()
to discard any duplicates.If
caches_target
is set, it callsget_target_cache()
and assigns result tocached_target
.Then depending on values for
create
,update
anddelete
it may call:And finally it calls
teardown()
for cleanup.- Parameters:
source_data – Sequence of normalized source data, if known.
progress – Optional progress indicator factory.
- Returns:
A 3-tuple of
(created, updated, deleted)
as follows:created
- list of records created on the targetupdated
- list of records updated on the targetdeleted
- list of records deleted on the target
- setup()[source]¶
Perform any setup needed before starting the import/export job.
This is called from within
process_data()
. Default logic does nothing.
- teardown()[source]¶
Perform any teardown needed after ending the import/export job.
This is called from within
process_data()
. Default logic does nothing.
- update = None¶
Flag indicating the current import/export job should update records on the target side, when applicable.
This flag is typically set by the caller, e.g. via command line args.
See also
allow_update
.
- update_target_object(obj, source_data, target_data=None)[source]¶
Update the target object with the given source data, and return the updated object.
This method may be called from
do_create_update()
for a normal update, orcreate_target_object()
when creating a new record.It should update the object for any of
get_fields()
which appear to differ. However it need not bother for theget_keys()
fields, since those will already be accurate.- Parameters:
obj – Raw target object.
source_data – Dict of normalized data for source record.
target_data – Dict of normalized data for existing target record, if a typical update. Will be missing for a new object.
- Returns:
The final updated object. In most/all cases this will be the same instance as the original
obj
provided by the caller.
- class wuttasync.importing.base.ToSqlalchemy(config, **kwargs)[source]¶
Base class for importer/exporter which uses SQLAlchemy ORM on the target side.
- caches_target = True¶
- get_target_objects(source_data=None, progress=None)[source]¶
Fetches target objects via the ORM query from
get_target_query()
.
- get_target_query(source_data=None)[source]¶
Returns an ORM query suitable to fetch existing objects from the target side. This is called from
get_target_objects()
.