API
The API is a single function ingest
, together with classes of string constants: HighWatermark
, Visibility
, Upsert
and Delete
. The constants are known strings rather than opaque identifiers to allow the strings to be easily passed from dynamic/non-Python environments.
ingest
(conn, metadata, batches, on_before_visible=lambda conn, latest_batch_metadata: None, high_watermark=HighWatermark.LATEST, visibility=Visibility.AFTER_EACH_BATCH, upsert=Upsert.IF_PRIMARY_KEY, delete=Delete.OFF, max_rows_per_table_buffer=10000)
Ingests data into a table
-
conn
- A SQLAlchemy connection not in a transaction, i.e. started byconnection
rather thanbegin
. -
metadata
- A SQLAlchemy metadata of one or more tables. -
batches
- A function that takes a high watermark, returning an iterable that yields data batches that are strictly after this high watermark. See Usage above for an example. -
on_before_visible
(optional) - A function that takes a SQLAlchemy connection in a transaction and batch metadata, called just before data becomes visible to other database clients. See Usage above for an example. -
high_watermark
(optional) - A member of theHighWatermark
class, or a JSON-encodable value.If this is
HighWatermark.LATEST
, then the most recent high watermark that been returned from a previous ingest’sbatch
function whose corresponding batch has been succesfully ingested is passed into thebatches
function. If there has been no previous ingest,None
will be passed.If this is
HighWatermark.EARLIEST
, then None will be passed to the batches function as the high watermark. This would typically be used to re-ingest all of the data.If this a JSON-encodable value other than
HighWatermark.LATEST
orHighWatermark.EARLIEST
, then this value is passed directly to thebatches
function. This can be used to override any previous high-watermark. Existing data in the target table is not deleted unless specified by thedelete
parameter. -
visibility
(optional) - A member of theVisibilty
class, controlling when ingests will be visible to other clients. -
upsert
(optional) - A member of theUpsert
class, controlling whether an upsert is performed when ingesting data -
delete
(optional) - A member of theDelete
class, controlling if existing rows are to be deleted. -
max_rows_per_table_buffer
(optional) - An integer number of rows to buffer in memory per table when ingesting into multiple tables.
HighWatermark
A class of constants to indicate what high watermark should be passed into the batches function.
-
LATEST
- pass the most recent high watermark yielded from the batches function from the previous ingest into the batches function. If there is no previous ingest, the Python valueNone
is passed. This is the string__LATEST__
. -
EARLIEST
- pass the Python valueNone
into the batches function. This is the string__EARLIEST__
.
Visibility
A class of constants to indicate when data changes are visible to other database clients. Schema changes become visible before the first batch.
AFTER_EACH_BATCH
- data changes are visible to other database clients after each batch. This is the string__AFTER_EACH_BATCH__
.
Delete
A class of constants that controls how existing data in the table is deleted
-
OFF
There is no deleting of existing data. This is the string
__OFF__
. -
BEFORE_FIRST_BATCH
All existing data in the table is deleted just before the first batch is ingested. If there are no batches, no data is deleted. This is the string
__BEFORE_FIRST_BATCH__
.
Upsert
A class of constants that controls if upserting is performed
-
OFF
No upserting is performed. This is the string
__OFF__
. -
IF_PRIMARY_KEY
If the table contains a primary key, an upsert based on that primary key is performed on ingest. If there is no primary key then a plain insert is performed. This is useful to avoid duplication if batches overlap. This is the string
__IF_PRIMARY_KEY__
.