Graphtik

Supported Python versions of latest release in PyPi Development Status (src: 10.1.0, git: v10.1.0 , Aug 31, 2020) Latest release in GitHub Latest version in PyPI Travis continuous integration testing ok? (Linux) ReadTheDocs ok? cover-status Code Style Apache License, version 2.0

Github watchers Github stargazers Github forks Issues count

It’s a DAG all the way down!

solution_x9_nodes quarantine quarantine get_out_or_stay_home OP: get_out_or_stay_home 0 ? } FN: get_out_or_stay_home quarantine:s->get_out_or_stay_home:n space space get_out_or_stay_home:s->space:n time time get_out_or_stay_home:s->time:n exercise OP: exercise 1 FN: exercise space:s->exercise:n read_book OP: read_book 2 FN: read_book time:s->read_book:n fun fun exercise:s->fun:n body body exercise:s->body:n read_book:s->fun:n brain brain read_book:s->brain:n legend legend

Lightweight computation graphs for Python

Graphtik is a library to design, plot & execute graphs of functions (a.k.a pipelines) that consume and populate (possibly nested) data, based on whether values for those data (a.k.a dependencies) exist.

  • The API posits a fair compromise between Features and complexity, without precluding any.

  • It can be used as is to build machine learning pipelines for data science projects.

  • It should be extendable to act as the core for a custom ETL engine, a workflow-processor for interdependent tasks & files like GNU Make, or a spreadsheet calculation engine.

Graphtik sprang from Graphkit (summer 2019, v1.2.2) to experiment with Python 3.6+ features, but has diverged significantly with enhancements ever since.

Table of Contents

Features

  • Can assemble existing functions without modifications into pipelines.

  • dependency resolution can bypass calculation cycles based on data given and asked.

  • Support functions with optional input args and/or varargs.

  • Support functions with partial outputs; keep working even if certain endured operations fail.

  • Support alias of function provides to avoid the need for trivial conveyor operations.

  • Default conveyor operation to easily pass (possibly nested) dependencies around.

  • Merge or nest sub-pipelines.

  • Hierarchical dependencies may access data values deep in solution with json pointer path expressions.

  • Hierarchical dependencies annotated as implicit imply which subdoc dependency the function reads or writes in the parent-doc.

  • Denote and schedule sideffects on dependency values, to update them repeatedly, avoiding cycles (e.g. to add columns into pandas.DataFrames).

  • Deterministic pre-decided execution plan (excepting partial-outputs or endured operations).

  • Early eviction of intermediate results from solution, to optimize memory footprint.

  • Solution tracks all intermediate overwritten values for the same dependency.

  • Parallel execution (but underdeveloped).

  • Elaborate plotting with configurable plot themes.

  • Integration with Sphinx sites with the new graphtik directive.

  • Authored with debugging in mind.

Anti-features

  • It’s not an orchestrator for long-running tasks, nor a calendar scheduler - Apache Airflow and Luigi may help for that.

  • It’s not really a parallelizing optimizer, neither a map-reduce framework - look additionally at Dask, IpyParallel, Celery, Hive, Pig, Spark, Hadoop, etc.

Quick start

Here’s how to install:

pip install graphtik

OR with dependencies for plotting support (and you need to install Graphviz program separately with your OS tools):

pip install graphtik[plot]

Let’s build a graphtik computation pipeline that produces x3 outputs out of 2 inputs a and b:

\[ \begin{align}\begin{aligned}a \times b\\a - a \times b\\|a - a \times b| ^ 3\end{aligned}\end{align} \]
>>> from graphtik import compose, operation
>>> from operator import mul, sub
>>> @operation(name="abs qubed",
...            needs=["a_minus_ab"],
...            provides=["abs_a_minus_ab_cubed"])
... def abs_qubed(a):
...    return abs(a) ** 3

Compose the abspow function along with mul & sub built-ins into a computation graph:

>>> graphop = compose("graphop",
...    operation(mul, needs=["a", "b"], provides=["ab"]),
...    operation(sub, needs=["a", "ab"], provides=["a_minus_ab"]),
...    abs_qubed,
... )
>>> graphop
Pipeline('graphop', needs=['a', 'b', 'ab', 'a_minus_ab'],
                  provides=['ab', 'a_minus_ab', 'abs_a_minus_ab_cubed'],
                  x3 ops: mul, sub, abs qubed)

You may plot the function graph in a file like this (if in jupyter, no need to specify the file, see Jupyter notebooks):

>>> graphop.plot('graphop.svg')      # doctest: +SKIP

As you can see, any function can be used as an operation in Graphtik, even ones imported from system modules.

Run the graph-operation and request all of the outputs:

>>> sol = graphop(**{'a': 2, 'b': 5})
>>> sol
{'a': 2, 'b': 5, 'ab': 10, 'a_minus_ab': -8, 'abs_a_minus_ab_cubed': 512}

Solutions are plottable as well:

>>> solution.plot('solution.svg')      # doctest: +SKIP

Run the graph-operation and request a subset of the outputs:

>>> solution = graphop.compute({'a': 2, 'b': 5}, outputs=["a_minus_ab"])
>>> solution
{'a_minus_ab': -8}

… where the (interactive) legend is this:

>>> from graphtik.plot import legend
>>> l = legend()

legend