4. Architecture¶

In mathematical terms, given:

a partially populated data tree, and
a set of functions operating (consuming/producing) on branches of the data tree,

graphtik collects a subset of functions in a graph that when executed consume & produce as values as possible in the data-tree.

compute¶

computation¶

phase¶

The definition & execution of pipelines happens in 3 phases:

composition
planning
execution

… it is constrained by these IO data-structures:

operation(s)
dependencies (needs & provides)
given inputs
asked outputs

… populates these low-level data-structures:

network (COMPOSE time)
execution dag (COMPILE time)
execution steps (COMPILE time)
solution (EXECUTE time)

… and utilizes these main classes:

`graphtik.fnop.FnOp`([fn, name, needs, ...])	An operation performing a callable (ie a function, a method, a lambda).
`graphtik.pipeline.Pipeline`(operations, name, *)	An operation that can compute a network-graph of operations.
`graphtik.planning.Network`(*operations[, graph])	A graph of operations that can compile an execution plan.
`graphtik.execution.ExecutionPlan`(net, needs, ...)	A pre-compiled list of operation steps that can execute for the given inputs/outputs.
`graphtik.execution.Solution`(plan, input_values)	The solution chain-map and execution state (e.g.

… plus those for plotting:

`graphtik.plot.Plotter`([theme])	a plotter renders diagram images of plottables.
`graphtik.plot.Theme`(*[, _prototype])	The poor man's css-like plot theme (see also `StyleStack`).

compose¶

composition¶

The phase where operations are constructed and grouped into pipelines and corresponding networks based on their dependencies.

Tip

Use operation() factory to construct FnOp instances (a.k.a. operations).
Use compose() factory to build Pipeline instances (a.k.a. pipelines).

recompute¶

There are 2 ways to feed the solution back into the same pipeline:

by reusing the pre-compiled plan (coarse-grained), or
by using the compute(recompute_from=...) argument (fine-grained),

as described in Re-computations tutorial section.

Attention

This feature is not well implemented (e.g. test_recompute_NEEDS_FIX()), neither thoroughly tested.

combine pipelines¶

When operations and/or pipelines are composed together, there are two ways to combine the operations contained into the new pipeline: operation merging (default) and operation nesting.

They are selected by the nest parameter of compose() factory.

operation merging¶

The default method to combine pipelines, also applied when simply merging operations.

Any identically-named operations override each other, with the operations added earlier in the .compose() call (further to the left) winning over those added later (further to the right).

seealso: Merging

operation nesting¶

The elaborate method to combine pipelines forming clusters.

The original pipelines are preserved intact in “isolated” clusters, by prefixing the names of their operations (and optionally data) by the name of the respective original pipeline that contained them (or the user defines the renames).

seealso: Nesting, compose(), RenArgs, nest_any_node(), dep_renamed(), PlotArgs.clusters, Hierarchical data and further tricks (example).

compile¶

compilation¶

planning¶

The phase where the Network creates a new execution plan by pruning all graph nodes into a subgraph dag, and deriving the execution steps.

execute¶

execution¶

sequential¶

The phase where the plan derived from a pipeline calls the underlying functions of all operations contained in its execution steps, with inputs/outputs taken/written to the solution.

Currently there are 2 ways to execute:

sequential
(deprecated) parallel, with a multiprocessing.pool.ProcessPool

Plans may abort their execution by setting the abort run global flag.

network¶

graph¶

A Network.graph of operations linked by their dependencies implementing a pipeline.

During composition, the nodes of the graph are connected by repeated calls of Network._append_operation() within Network constructor.

During planning the graph is pruned based on the given inputs, outputs & node predicate to extract the dag, and it is ordered, to derive the execution steps, stored in a new plan, which is then cached on the Network class.

plan¶

execution plan¶

Class ExecutionPlan perform the execution phase which contains the dag and the steps.

compileed execution plans are cached in Network._cached_plans across runs with (inputs, outputs, predicate) as key.

solution¶

A map of dependency-named values fed to/from the pipeline during execution.

It feeds operations with inputs, collects their outputs, records the status of executed or canceled operations, tracks any overwrites, and applies any evictions, as orchestrated by the plan.

A new Solution instance is created either internally by Pipeline.compute() and populated with user-inputs, or must be created externally with those values and fed into the said method.

The results of the last operation executed “win” in the layers, and the base (least precedence) is the user-inputs given when the execution started.

Certain values may be extracted/populated with accessors.

layer¶

solution layer¶

The solution class inherits ChainMap, to store the actual outputs of each executed operation in a separate dictionary (+1 for user-inputs).

When layers are disabled, the solution populates the passed-in inputs and stores in layers just the keys of outputs produced.

The layering, by default, is disabled if there is no jsonp dependency in the network, and set_layered_solution() configurations has not been set, nor has the respective parameter been given to methods compute()/execute().

If disabled, overwrites are lost, but are marked as such.

Hint

Combining hierarchical data with per-operation layers in solution leads to duplications of container nodes in the data tree. To retrieve the complete solution, merging of overwritten nodes across the layers would then be needed.

overwrite¶

solution values written by more than one operations in the respective layer, accessed by Solution.overwrites attribute (assuming that layers have not been disabled e.g. due to hierarchical data, in which case, just the dependency names of the outputs actually produced are stored).

Note that sideffected outputs always produce an overwrite.

Overwrites will not work for If evicted outputs.

prune¶

pruning¶

A subphase of planning performed by method Network._prune_graph(), which extracts a subgraph dag that does not contain any unsatisfied operations.

It topologically sorts the graph, and prunes based on given inputs, asked outputs, node predicate and operation needs & provides.

unsatisfied operation¶

The core of pruning & rescheduling, performed by planning.unsatisfied_operations() function, which collects all operations with unreachable dependencies:

they have needs that do not correspond to any of the given inputs or the intermediately computed outputs of the solution;
all their provides are NOT needed by any other operation, nor are asked as outputs.

dag¶

execution dag¶

solution dag¶

There are 2 directed-acyclic-graphs instances used:

the ExecutionPlan.dag, in the execution plan, which contains the pruned nodes, used to decide the execution steps;
the Solution.dag in the solution, which derives the canceled operations due to rescheduled/failed operations upstream.

steps¶

execution steps¶

The plan contains a list of the operation-nodes only from the dag, topologically sorted, and interspersed with instruction steps needed to compute the asked outputs from the given inputs.

They are built by Network._build_execution_steps() based on the subgraph dag.

The only instruction step other than an operation is for performing an eviction.

eviction¶

A memory footprint optimization where intermediate inputs & outputs are erased from solution as soon as they are not needed further down the dag.

Evictions are pre-calculated during planning, denoted with the dependency inserted in the steps of the execution plan.

Evictions inhibit overwrites.

inputs¶

The named input values that are fed into an operation (or pipeline) through Operation.compute() method according to its needs.

These values are either:

given by the user to the outer pipeline, at the start of a computation, or
derived from solution using needs as keys, during intermediate execution.

outputs¶

The dictionary of computed values returned by an operation (or a pipeline) matching its provides, when method Operation.compute() is called.

Those values are either:

retained in the solution, internally during execution, keyed by the respective provide, or
returned to user after the outer pipeline has finished computation.

When no specific outputs requested from a pipeline, Pipeline.compute() returns all intermediate inputs along with the outputs, that is, no evictions happens.

An operation may return partial outputs.

pipeline¶

The Pipeline composes and computes a network of operations against given inputs & outputs.

This class is also an operation, so it specifies needs & provides but these are not fixed, in the sense that Pipeline.compute() can potentially consume and provide different subsets of inputs/outputs.

operation¶

Either the abstract notion of an action with specified needs and provides, dependencies, or the concrete wrapper FnOp for (any callable()), that feeds on inputs and update outputs, from/to solution, or given-by/returned-to the user by a pipeline.

The distinction between needs/provides and inputs/outputs is akin to function parameters and arguments during define-time and run-time, respectively.

dependency¶

The (possibly hierarchical) name of a solution value an operation needs or provides.

Dependencies are declared during composition, when building FnOp instances. Operations are then interlinked together, by matching the needs & provides of all operations contained in a pipeline.
During planning the graph is then pruned based on the reachability of the dependencies.
During execution Operation.compute() performs 2 “matchings”:
- inputs & outputs in solution are accessed by the needs & provides names of the operations;
- operation needs & provides are zipped against the underlying function’s arguments and results.
These matchings are affected by modifiers, print-out with diacritics.

Differences between various dependency operation attributes:

dependency attribute

dupes

token

alias

sfxed

needs

needs

✗

✓

SINGULAR

_user_needs

✓

✓

_fn_needs

✓

✗

STRIPPED

provides

provides

✗

✓

✓

SINGULAR

_user_provides

✓

✓

✗

_fn_provides

✓

✗

✗

STRIPPED

where:

“dupes=no” means the collection drops any duplicated dependencies

“SINGULAR” means sfxed('A', 'a', 'b') ==> sfxed('A', 'b'), sfxed('A', 'b')

“STRIPPED” means sfxed('A', 'a', 'b') ==> token('a'), sfxed('b')

needs¶

fn_needs¶

matching inputs¶

The list of dependency names an operation requires from solution as inputs,

roughly corresponding to underlying function’s arguments (fn_needs).

Specifically, Operation.compute() extracts input values from solution by these names, and matches them against function arguments, mostly by their positional order. Whenever this matching is not 1-to-1, and function-arguments differ from the regular needs, modifiers must be used.

provides¶

user_provides¶

fn_provides¶

zipping outputs¶

The list of dependency names an operation writes to the solution as outputs,

roughly corresponding to underlying function’s results (fn_provides).

Specifically, Operation.compute() “zips” this list-of-names with the output values produced when the operation’s function is called. You may alter this “zipping” by one of the following methods:

artificially extended the provides with aliased fn_provides,
use modifiers to annotate certain names with keyword(), tokens and/or implicit, or
mark the operation that its function returns dictionary, and cancel zipping.

Note

When joining a pipeline this must not be empty, or will scream! (an operation without provides would always be pruned)

alias¶

Map an existing name in fn_provides into a duplicate, artificial one in provides .

You cannot alias an alias. See Interface differently named dependencies: aliases & keyword modifier

conveyor operation¶

default identity function¶

The default function if none given to an operation that conveys needs to provides.

For this to happen when FnOp.compute() is called, an operation name must have been given AND the number of provides must match that of the number of needs.

seealso: Default conveyor operation & identity_function().

returns dictionary¶

When an operation is marked with FnOp.returns_dict flag, the underlying function is not expected to return fn_provides as a sequence but as a dictionary; hence, no “zipping” of function-results –> fn_provides takes place.

Usefull for operations returning partial outputs to have full control over which outputs were actually produced, or to cancel tokens.

modifier¶

diacritic¶

Dependency annotations modifying their behavior during planning, execution,

and when binding them as the inputs & outputs of operations.

The needs and provides may be annotated as:

as keyword() and/or optionals to modify their binding with the underlying function’s arguments
as tokens to let operations form arbitrary chains,
as sideffected to operate on the same data more than once (something that a DAG does not allow),
as implicit to let functions do the actual read/write in solution,
as accessor to modularize that access of solution data (only jsonp implemented so far)

the last two often working together to manipulate hierarchical data.

The representation of modifier-annotated dependencies utilize a combination of these diacritics:

>   : keyword()
?   : optional()
*   : vararg()
+   : varargs()
@   : accessor (mostly for jsonp)
$   : token()
^   : implicit()

See graphtik.modifier module.

optionals¶

A needs only modifier for a inputs that do not hinder operation execution (prune) if absent from solution.

In the underlying function it may correspond to either:

non-compulsory function arguments (with defaults), annotated with optional(), or
varargish arguments, annotated with vararg() or varargs().

varargish¶

A needs only modifier for inputs to be appended as *args (if present in solution).

There are 2 kinds, both, by definition, optionals:

the vararg() annotates any solution value to be appended once in the *args;
the varargs() annotates iterable values and all its items are appended in the *args one-by-one.

Attention

To avoid user mistakes, varargs do not accept str inputs (though iterables):

>>> graph(a=5, b="mistake")
Traceback (most recent call last):
ValueError: Failed matching inputs <=> needs for FnOp(name='enlist',
            needs=['a', 'b'(+)],
            provides=['sum'],
            fn='enlist'):
    1. Expected varargs inputs to be non-str iterables: {'b'(+): 'mistake'}
    +++inputs: ['a', 'b']

In printouts, it is denoted either with * or + diacritic.

See also the elaborate example in Hierarchical data and further tricks section.

implicit¶

A modifier denoting a dependency not fed into/out of the function, but the dependency is still considered while planning, expected to exist in the solution, downstream.

One use case is for an operation to consume/produce a subdoc(s) with its own means (not through jsonp accessors).

Constructed with the implicit() modifier function, they can also be optionals and jsonp (but without accessors). If an implicit cannot solve your problems, try sideffected or tokens…

tokens¶

sideffects¶

A modifier denoting a fictive dependency linking operations into virtual flows, without real data exchanges.

The side-effect modification may happen to some internal state not fully represented in the graph & solution.

There are actually 2 relevant modifiers:

The with token() modifier describing modifications taking place beyond the scope of the solution, and can connect operations arbitrarily, irrespective of data exchanges. It may have just the “optional” diacritic in printouts.

Tip

Probably you either need implicit, or the next variant, not this one.
The sideffected modifier (annotated with sfxed()) denoting modifications on a real dependency read from and written to the solution.

Both kinds of sideffects participate in the planning of the graph, and both may be given or asked in the inputs & outputs of a pipeline, but they are never given to functions. A function of a returns dictionary operation can return a falsy value to declare it as canceled.

sideffected¶

sfx_list¶

A modifier denoting (sfx_list) sideffects acting on a solution dependency.

Note

To be precise, the “sideffected dependency” is the name held in _Modifier._sideffected attribute of a modifier created by sfxed() function; it may have all diacritics in printouts.

The main use case is to declare an operation that both needs and provides the same dependency, to mutate it. When designing a network with many sfxed modifiers all based on the same sideffected dependency (i.e. with different sfx_list), then these should form a strict (no forks) sequence, or else, fork modifications will be lost.

The outputs of a sideffected dependency will produce an overwrite if the sideffected dependency is declared both as needs and provides of some operation.

See also the elaborate example in Hierarchical data and further tricks section.

accessor¶

Getter/setter functions to extract/populate solution values given as a modifier parameter (not applicable for tokens & implicit).

See Accessor defining class and the modify() concrete factory.

subdoc¶

superdoc¶

doc chain¶

data tree¶

hierarchical data¶

A subdoc is a dependency value nested further into another one (the superdoc), accessed with a json pointer path expression with respect to the solution, denoted with slashes like: root/parent/child/leaf

Whenever a nested dependency is given/asked, then all docs-in-chain (depicted below) are topologically sorted, before executing any operations working on them.

The docs-in-chain for a hypothetical dependency stats/b/b1: superdocs at the left, subdocs at the right of b1, respectively.¶

Note that if the root has been asked in outputs, none of its subdocs will be evicted.

seealso: :Hierarchical data and further tricks (example)

json pointer path¶

jsonp¶

A dependency containing slashes(/) & accessors that can read and write subdoc values with json pointer expressions, like root/parent/child/1/item, resolved from solution.

In addition to writing values, the vcat() or hcat() modifiers (& respective accessors) support also pandas concatenation for provides.

Note that all non-root dependencies are implicitly created as jsonp if the operation has a current-working-document defined.

cwd¶

current-working-document¶

A jsonp prefix of an operation (or pipeline) to prefix any non-root dependency defined.

pandas concatenation¶

A jsonp dependency in provides may designate its respective DataFrame and/or Series output value to be concatenated with existing Pandas objects in the solution (usefull for when working with Pandas advanced indexing. or else, sideffecteds are needed to break read-update cycles on dataframes).

See example in Concatenating Pandas.

reschedule¶

rescheduling¶

partial outputs¶

canceled operation¶

The partial pruning of the solution’s dag during execution. It happens when any of these 2 conditions apply:

an operation is marked with the FnOp.rescheduled attribute, which means that its underlying callable may produce only a subset of its provides (partial outputs);
endurance is enabled, either globally (in the configurations), or for a specific operation.

The solution must then reschedule the remaining operations downstream, and possibly cancel some of those ( assigned in Solution.canceled).

Partial operations are usually declared with returns dictionary so that the underlying function can control which of the outputs are returned.

See Operations with partial outputs (rescheduled)

endurance¶

endured¶

Keep executing as many operations as possible, even if some of them fail. Endurance for an operation is enabled if set_endure_operations() is true globally in the configurations or if FnOp.endured is true.

You may interrogate Solution.executed to discover the status of each executed operations or call one of check_if_incomplete() or scream_if_incomplete().

See Depending on sideffects

predicate¶

node predicate¶

A callable(op, node-data) that should return true for nodes to be included in graph during planning.

abort run¶

A global configurations flag that when set with abort_run() function, it halts the execution of all currently or future plans.

It is reset automatically on every call of Pipeline.compute() (after a successful intermediate planning), or manually, by calling reset_abort().

parallel¶

parallel execution¶

execution pool¶

task¶

Attention

Deprecated, in favor of always producing a list of “parallelizable batches”, to hook with other executors (e.g. Dask, Apache’s airflow, Celery). In the future, just the single-process implementation will be kept, and marshalling should be handled externally.

execute operations in parallel, with a thread pool or process pool (instead of sequential). Operations and pipeline are marked as such on construction, or enabled globally from configurations.

Note a tokens are not expected to function with process pools, certainly not when marshalling is enabled.

process pool¶

When the multiprocessing.pool.Pool class is used for (deprecated) parallel execution, the tasks must be communicated to/from the worker process, which requires pickling, and that may fail. With pickling failures you may try marshalling with dill library, and see if that helps.

Note that tokens are not expected to function at all. certainly not when marshalling is enabled.

thread pool¶

When the multiprocessing.dummy.Pool() class is used for (deprecated) parallel execution, the tasks are run in process, so no marshalling is needed.

marshalling¶

(deprecated) Pickling parallel operations and their inputs/outputs using the dill module. It is configured either globally with set_marshal_tasks() or set with a flag on each operation / pipeline.

Note that tokens do not work when this is enabled.

plottable¶

Objects that can plot their graph network, such as those inheriting Plottable, (FnOp, Pipeline, Network, ExecutionPlan, Solution) or a pydot.Dot instance (the result of the Plottable.plot() method).

Such objects may render as SVG in Jupiter notebooks (through their plot() method) and can render in a Sphinx site with with the graphtik RsT directive. You may control the rendered image as explained in the tip of the Plotting section.

SVGs are in rendered with the zoom-and-pan javascript library

Attention

Zoom-and-pan does not work in Sphinx sites for Chrome locally - serve the HTML files through some HTTP server, e.g. launch this command to view the site of this project:
python -m http.server 8080 --directory build/sphinx/html/

plotter¶

plotting¶

A Plotter is responsible for rendering plottables as images. It is the active plotter that does that, unless overridden in a Plottable.plot() call. Plotters can be customized by various means, such plot theme.

active plotter¶

default active plotter¶

The plotter currently installed “in-context” of the respective graphtik configuration - this term implies also any Plot customizations done on the active plotter (such as plot theme).

Installation happens by calling one of active_plotter_plugged() or set_active_plotter() functions.

The default active plotter is the plotter instance that this project comes pre-configured with, ie, when no plot-customizations have yet happened.

Attention

It is recommended to use other means for Plot customizations instead of modifying directly theme’s class-attributes.

All Theme class-attributes are deep-copied when constructing new instances, to avoid modifications by mistake, while attempting to update instance-attributes instead (hint: allmost all its attributes are containers i.e. dicts). Therefore any class-attributes modification will be ignored, until a new Theme instance from the patched class is used .

plot theme¶

current theme¶

The mergeable and expandable styles contained in a plot.Theme instance.

The current theme in-use is the Plotter.default_theme attribute of the active plotter, unless overridden with the theme parameter when calling Plottable.plot() (conveyed internally as the value of the PlotArgs.theme attribute).

style¶

style expansion¶

A style is an attribute of a plot theme, either a scalar value or a dictionary.

Styles are collected in stacks and are merged into a single dictionary after performing the following expansions:

Call any callables found as keys, values or the whole style-dict, passing in the current plot_args, and replace those with the callable’s result (even more flexible than templates).

Resolve any Ref instances, first against the current nx_attrs and then against the attributes of the current theme.

Render jinja2 templates with template-arguments all attributes of plot_args instance in use, (hence much more flexible than Ref).

Any Nones results above are discarded.

Workaround pydot/pydot#228 pydot-cstor not supporting styles-as-lists.

Merge tooltip & tooltip lists.

Tip

if DEBUG flag is enabled, the provenance of all style values appears in the tooltips of plotted graphs.

configurations¶

graphtik configuration¶

The functions controlling compile & execution globally are defined in config module and +1 in graphtik.plot module; the underlying global data are stored in contextvars.ContextVar instances, to allow for nested control.

All boolean configuration flags are tri-state (None, False, True), allowing to “force” all operations, when they are not set to the None value. All of them default to None (false).

callbacks¶

x2 optional callables called before/after each operation Pipeline.compute(). Attention, any errors will abort the pipeline execution.

pre-op-callback: Called from solution code before marshalling. A use case would be to validate solution, or trigger a breakpoint by some condition.
post-op-callback:: Called after solution have been populated with operation results. A use case would be to validate operation outputs and/or solution after results have been populated.

Callbacks must have this signature:

callbacks(op_cb) -> None

… where op_cb is an instance of the OpTask.

jetsam¶

When a pipeline or an operation fails, the original exception gets annotated with salvaged values from locals() and raised intact, and optionally (if DEBUG flag) the diagram of the failed plottable is saved in temporary file.

See Jetsam on exceptions.