Graphtik¶
(src: 10.2.1, git: v10.2.1 , Sep 18, 2020)
It’s a DAG all the way down!
Lightweight computation graphs for Python¶
Graphtik is a library to compose, plot & execute graphs of python functions (a.k.a pipelines) that consume and populate (possibly nested) named data (a.k.a dependencies), based on whether values for those dependencies exist in the inputs or have been calculated earlier, with pandas in mind.
Its primary use case is building flexible algorithms for data science/machine learning projects.
It should be extendable to implement the following:
an IoC dependency resolver (e.g. Java Spring);
an executor of interdependent tasks based on files (e.g. GNU Make);
a custom ETL engine;
a spreadsheet calculation engine.
Graphtik sprang from Graphkit (summer 2019, v1.2.2) to experiment with Python 3.6+ features, but has diverged significantly with enhancements ever since.
Table of Contents
- 1. Operations
- 2. Pipelines
- 3. Plotting and Debugging
- 4. Architecture
- 5. API Reference
- 6. Changes
- GitHub Releases
- Changelog
- v10.2.1 (18 Sep 2020, @ankostis): plot sol bugfix
- v10.2.0 (16 Sep 2020, @ankostis): RECOMPUTE, pre-callback, drop op_xxx, ops-eq-op.name, drop NULL_OP
- v10.1.0 (5 Aug 2020, @ankostis): rename return-dict outs; step number badges
- v9.2.0 (4 Jul 2020, @ankostis): Drop MultiValueError
- v9.1.0 (4 Jul 2020, @ankostis): Bugfix, panda-polite, privatize modifier fields
- v9.0.0 (30 Jun 2020, @ankostis): JSONP; net, evictions & sfxed fixes; conveyor fn; rename modules
- v8.4.0 (15 May 2020, @ankostis): subclass-able Op, plot edges from south–>north of nodes
- v8.3.1 (14 May 2020, @ankostis): plot edges from south–>north of nodes
- v8.3.0 (12 May 2020, @ankostis): mapped–>keyword, drop sol-finalize
- v8.2.0 (11 May 2020, @ankostis): custom Solutions, Task-context
- v8.1.0 (11 May 2020, @ankostis): drop last plan, Rename/Nest, Netop–>Pipeline, purify modules
- v8.0.2 (7 May 2020, @ankostis): re-MODULE; sideffect –> sfx; all DIACRITIC Modifiers; invert “merge” meaning
- v8.0.0, v8.0.1 (7 May 2020, @ankostis): retracted bc found more restructurings
- v7.1.1 (5 May 2020, @ankostis): canceled, by mistake contained features for 8.x
- NET: fix rescheduled, cancelable sfx, improve compute API
- MODIFIERS: modifier combinations, rename sol_sideffects
- PLOT: them-ize all, convey user-attrs, draw nest clusters, click SVGs to open in tab, …
- Various: raise TypeErrors, improve “operations” section
- MODIFIERS: Sideffecteds; arg–> mapped
- PLOT: Badges, StyleStacks, refact Themes, fix style mis-classifications, don’t plot steps
- Sphinx extension:
- Configurations:
- DOC:
- v6.2.0 (19 Apr 2020, @ankostis): plotting fixes & more styles, net find util methods
- v6.1.0 (14 Apr 2020, @ankostis): config plugs & fix styles
- v6.0.0 (13 Apr 2020, @ankostis): New Plotting Device…
- v5.7.1 (7 Apr 2020, @ankostis): Plot job, fix RTD deps
- v5.7.0 (6 Apr 2020, @ankostis): FIX +SphinxExt in Wheel
- v5.6.0 (6 Apr 2020, @ankostis, BROKEN): +check_if_incomplete
- v5.5.0 (1 Apr 2020, @ankostis, BROKEN): ortho plots
- v5.4.0 (29 Mar 2020, @ankostis, BROKEN): auto-name ops, dogfood quickstart
- v5.3.0 (28 Mar 2020, @ankostis, BROKEN): Sphinx plots, fail-early on bad op
- v5.2.2 (03 Mar 2020, @ankostis): stuck in PARALLEL, fix Impossible Outs, plot quoting, legend node
- v5.2.1 (28 Feb 2020, @ankostis): fix plan cache on skip-evictions, PY3.8 TCs, docs
- v5.2.0 (27 Feb 2020, @ankostis): Map needs inputs –> args, SPELLCHECK
- v5.1.0 (22 Jan 2020, @ankostis): accept named-tuples/objects provides
- v5.0.0 (31 Dec 2019, @ankostis): Method–>Parallel, all configs now per op flags; Screaming Solutions on fails/partials
- v4.4.1 (22 Dec 2019, @ankostis): bugfix debug print
- v4.4.0 (21 Dec 2019, @ankostis): RESCHEDULE for PARTIAL Outputs, on a per op basis
- Details
- v4.3.0 (16 Dec 2019, @ankostis): Aliases
- v4.2.0 (16 Dec 2019, @ankostis): ENDURED Execution
- v4.1.0 (13 Dec 2019, @ankostis): ChainMap Solution for Rewrites, stable TOPOLOGICAL sort
- v4.0.1 (12 Dec 2019, @ankostis): bugfix
- v4.0.0 (11 Dec 2019, @ankostis): NESTED merge, revert v3.x Unvarying, immutable OPs, “color” nodes
- v3.1.0 (6 Dec 2019, @ankostis): cooler
prune()
- v3.0.0 (2 Dec 2019, @ankostis): UNVARYING NetOperations, narrowed, API refact
- v2.3.0 (24 Nov 2019, @ankostis): Zoomable SVGs & more op jobs
- v2.2.0 (20 Nov 2019, @ankostis): enhance OPERATIONS & restruct their modules
- v2.1.1 (12 Nov 2019, @ankostis): global configs
- v2.1.0 (20 Oct 2019, @ankostis): DROP BW-compatible, Restruct modules/API, Plan perfect evictions
- v2.0.0b1 (15 Oct 2019, @ankostis): Rebranded as Graphtik for Python 3.6+
- Network
- Testing & other code:
- Network:
- Plotting:
- Testing & other code:
- Chore & Docs:
- v1.2.4 (Mar 7, 2018)
- 1.2.2 (Mar 7, 2018, @huyng): Fixed versioning
- 1.2.1 (Feb 23, 2018, @huyng): Fixed multi-threading bug and faster compute through caching of find_necessary_steps
- 1.2.0 (Feb 13, 2018, @huyng)
- 1.1.0 (Nov 9, 2017, @huyng)
- 1.0.4 (Nov 3, 2017, @huyng): Networkx 2.0 compatibility
- 1.0.3 (Jan 31, 2017, @huyng): Make plotting dependencies optional
- 1.0.2 (Sep 29, 2016, @pumpikano): Merge pull request yahoo#5 from yahoo/remove-packaging-dep
- 1.0.1 (Aug 24, 2016)
- 1.0 (Aug 2, 2016, @robwhess)
- 7. Index
Features¶
Deterministic pre-decided execution plan (excepting partial-outputs or endured operations, see below).
Can assemble existing functions without modifications into pipelines.
dependency resolution can bypass calculation cycles based on data given and asked.
Support functions with partial outputs; keep working even if certain endured operations fail.
Facilitate trivial conveyor operations and alias on provides.
Support cycles, by annotating repeated updates of dependency values as sideffects, (e.g. to add columns into
pandas.DataFrame
s).Hierarchical dependencies may access data values deep in solution with json pointer path expressions.
Hierarchical dependencies annotated as implicit imply which subdoc dependency the function reads or writes in the parent-doc.
Early eviction of intermediate results from solution, to optimize memory footprint.
Solution tracks all intermediate overwritten values for the same dependency.
Parallel execution (but underdeveloped).
Elaborate Graphviz plotting with configurable plot themes.
Integration with Sphinx sites with the new
graphtik
directive.Authored with debugging in mind.
Parallel execution (but underdeveloped & deprecated).
Anti-features¶
It’s not an orchestrator for long-running tasks, nor a calendar scheduler - Apache Airflow, Dagster or Luigi may help for that.
It’s not really a parallelizing optimizer, neither a map-reduce framework - look additionally at Dask, IpyParallel, Celery, Hive, Pig, Spark, Hadoop, etc.
Quick start¶
Here’s how to install:
pip install graphtik
OR with dependencies for plotting support (and you need to install Graphviz program separately with your OS tools):
pip install graphtik[plot]
Let’s build a graphtik computation pipeline that produces the following
x3 outputs out of x2 inputs (α
and β
):
>>> from graphtik import compose, operation
>>> from operator import mul, sub
>>> @operation(name="abs qubed",
... needs=["α-α×β"],
... provides=["|α-α×β|³"])
... def abs_qubed(a):
... return abs(a) ** 3
Hint
Notice that graphtik has not problem working in unicode chars for dependency names.
Compose the abspow
function along with mul
& sub
built-ins
into a computation graph:
>>> graphop = compose("graphop",
... operation(mul, needs=["α", "β"], provides=["α×β"]),
... operation(sub, needs=["α", "α×β"], provides=["α-α×β"]),
... abs_qubed,
... )
>>> graphop
Pipeline('graphop', needs=['α', 'β', 'α×β', 'α-α×β'],
provides=['α×β', 'α-α×β', '|α-α×β|³'],
x3 ops: mul, sub, abs qubed)
You may plot the function graph in a file like this (if in jupyter, no need to specify the file, see Jupyter notebooks):
>>> graphop.plot('graphop.svg') # doctest: +SKIP
As you can see, any function can be used as an operation in Graphtik, even ones imported from system modules.
Run the graph-operation and request all of the outputs:
>>> sol = graphop(**{'α': 2, 'β': 5})
>>> sol
{'α': 2, 'β': 5, 'α×β': 10, 'α-α×β': -8, '|α-α×β|³': 512}
Solutions are plottable as well:
>>> solution.plot('solution.svg') # doctest: +SKIP
Run the graph-operation and request a subset of the outputs:
>>> solution = graphop.compute({'α': 2, 'β': 5}, outputs=["α-α×β"])
>>> solution
{'α-α×β': -8}
… where the (interactive) legend is this:
>>> from graphtik.plot import legend
>>> l = legend()