Graphtik¶

(src: 10.1.0, git: v10.1.0 , Aug 31, 2020)

It’s a DAG all the way down!

Lightweight computation graphs for Python¶

Graphtik is a library to design, plot & execute graphs of functions (a.k.a pipelines) that consume and populate (possibly nested) data, based on whether values for those data (a.k.a dependencies) exist.

The API posits a fair compromise between Features and complexity, without precluding any.
It can be used as is to build machine learning pipelines for data science projects.
It should be extendable to act as the core for a custom ETL engine, a workflow-processor for interdependent tasks & files like GNU Make, or a spreadsheet calculation engine.

Graphtik sprang from Graphkit (summer 2019, v1.2.2) to experiment with Python 3.6+ features, but has diverged significantly with enhancements ever since.

Table of Contents

Features¶

Can assemble existing functions without modifications into pipelines.
dependency resolution can bypass calculation cycles based on data given and asked.
Support functions with optional input args and/or varargs.
Support functions with partial outputs; keep working even if certain endured operations fail.
Support alias of function provides to avoid the need for trivial conveyor operations.
Default conveyor operation to easily pass (possibly nested) dependencies around.
Merge or nest sub-pipelines.
Hierarchical dependencies may access data values deep in solution with json pointer path expressions.
Hierarchical dependencies annotated as implicit imply which subdoc dependency the function reads or writes in the parent-doc.
Denote and schedule sideffects on dependency values, to update them repeatedly, avoiding cycles (e.g. to add columns into pandas.DataFrames).
Deterministic pre-decided execution plan (excepting partial-outputs or endured operations).
Early eviction of intermediate results from solution, to optimize memory footprint.
Solution tracks all intermediate overwritten values for the same dependency.
Parallel execution (but underdeveloped).
Elaborate plotting with configurable plot themes.
Integration with Sphinx sites with the new graphtik directive.
Authored with debugging in mind.

Anti-features¶

It’s not an orchestrator for long-running tasks, nor a calendar scheduler - Apache Airflow and Luigi may help for that.
It’s not really a parallelizing optimizer, neither a map-reduce framework - look additionally at Dask, IpyParallel, Celery, Hive, Pig, Spark, Hadoop, etc.

Quick start¶

Here’s how to install:

pip install graphtik

OR with dependencies for plotting support (and you need to install Graphviz program separately with your OS tools):

pip install graphtik[plot]

Let’s build a graphtik computation pipeline that produces x3 outputs out of 2 inputs a and b:

\[ \begin{align}\begin{aligned}a \times b\\a - a \times b\\|a - a \times b| ^ 3\end{aligned}\end{align} \]

>>> from graphtik import compose, operation
>>> from operator import mul, sub

>>> @operation(name="abs qubed",
...            needs=["a_minus_ab"],
...            provides=["abs_a_minus_ab_cubed"])
... def abs_qubed(a):
...    return abs(a) ** 3

Compose the abspow function along with mul & sub built-ins into a computation graph:

>>> graphop = compose("graphop",
...    operation(mul, needs=["a", "b"], provides=["ab"]),
...    operation(sub, needs=["a", "ab"], provides=["a_minus_ab"]),
...    abs_qubed,
... )
>>> graphop
Pipeline('graphop', needs=['a', 'b', 'ab', 'a_minus_ab'],
                  provides=['ab', 'a_minus_ab', 'abs_a_minus_ab_cubed'],
                  x3 ops: mul, sub, abs qubed)

You may plot the function graph in a file like this (if in jupyter, no need to specify the file, see Jupyter notebooks):

>>> graphop.plot('graphop.svg')      # doctest: +SKIP

As you can see, any function can be used as an operation in Graphtik, even ones imported from system modules.

Run the graph-operation and request all of the outputs:

>>> sol = graphop(**{'a': 2, 'b': 5})
>>> sol
{'a': 2, 'b': 5, 'ab': 10, 'a_minus_ab': -8, 'abs_a_minus_ab_cubed': 512}

Solutions are plottable as well:

>>> solution.plot('solution.svg')      # doctest: +SKIP

Run the graph-operation and request a subset of the outputs:

>>> solution = graphop.compute({'a': 2, 'b': 5}, outputs=["a_minus_ab"])
>>> solution
{'a_minus_ab': -8}

… where the (interactive) legend is this:

>>> from graphtik.plot import legend
>>> l = legend()

legend¶