This post aims to show off the ease of importing from external data into Zef and in a future post we'll use a proxy view to expose Zef to 3rd-party packages.
1) Get some external data
We'll use the Northwind dataset as an example, which describes sales and orders for a company. This is available from here https://code.google.com/archive/p/northwindextended/downloads. To convert this to CSV files, I wrote a little script available
export.py, which creates a temporary SQLite DB to export each table as its own CSV file. If you'd like to follow along as home, to save you the bother I've made these CSV exports available
After running this script, we find there are 14 CSV files in this dataset. I'll use
products.csv to demonstrate some of the features below and its first few rows look like:
|1||Chai||8||1||10 boxes x 30 bags||18||39||0||10||1|
|2||Chang||1||1||24 - 12 oz bottles||19||17||40||25||1|
|3||Aniseed Syrup||1||2||12 - 550 ml bottles||10||13||70||25||0|
So long as the CSV files can be imported using the
pandas python module, then we are able to import these into a Zef graph.
2) Import into Zef
If you have seen the "Import from CSV" how-to in our docs, you might first think to jump straight to the Zef
pandas_to_gd op to import it. However, the CSV files in this dataset are in a representation that is best suited for SQL, with columns representing both fields AND relationships between different entities.
Instead, we need to provide a declaration for the set of tables that the CSV files represent, which all together produce the right graph structure. For example, we should be able to specify the purpose of each column in the
products.csv table to be something like:
The above is just a layout for me to organise my thoughts on what each column should do. I could have instead written the above in natural language:
ProductIDcolumn should represent the ID of the
which this table defines in each row. The IDs will be stored on the Zef graph
ProductNamecolumn gives fields for each row, which will have a relation
RT.ProductNameand be a string.
SupplierIDcolumn is a different entity of type
identified by the integer in this column. It is linked to the
We will need to introduce a couple of more "purposes" in a moment, but otherwise this has nearly covered all of the main uses of the dataset.
Writing this all out by hand is tedious, so I wrote up a quick parser in Zef to produce the initial layout for you and then allow you to edit it. I made the following by running:
from zef.experimental import sql_import decl = sql_import.guess_csvs("products.csv") decl | write_file["guess.yaml"] | run
then editing the file
guess.yaml a little bit:
default_ID: ID definitions: - tag: products data_source: filename: products.csv type: csv kind: entity ID_col: ProductID cols: - name: ProductID purpose: id RT: ID data_type: Int - name: ProductName purpose: field RT: ProductName data_type: String - name: SupplierID purpose: entity ET: Supplier RT: SuppliedBy data_type: Int
I then wrote up a function
import_actions which will take this declaration of how to map the CSV data to the graph and do the busy-work to product a graph. Here is how to run that:
decl = "edited.yaml" | load_file | run | get["content"] | collect g = Graph() actions = sql_import.import_actions(decl) actions | transact[g] | run
import_actions function will use the information in
decl to find the files from which to read the raw data.
While the above works okay on the
products.csv table, I needed to include two more things to allow the import of all CSV files simulataneously: a way to tag a table as a "entity" or "relation" style and a source/target purpose.
Editing the yaml file by hand is clunky. However, it does have the benefit of being easy to a) read as plain text, and b) save the declaration of the import without custom data structures.
To get rid of the clunkiness, and also indulge myself to explore a new package, I decided to write a little UI using
pyimgui to better edit the file. Try it out on my sample below.
python -m zef.experimental.sql_ui.wizard sql_import.yaml
You should see something like:
The UI is in its early stages so might not work fully. It also requires installing the
pyimgui module with sdl support (i.e.
pip3 install pyimgui[sdl2] or the like). Weirdly this seems to have problems with python version 3.10 on MacOS... we will be looking into this.
To read more about using this GUI, check out our docs page on Multiple interlinked CSVs
"relation" kinds of tables
If there is a one-to-many or many-to-many relationship between two objects in a SQL database, then there are various ways to represent this. In the Northwind example, the "Order Details" table demonstrates this, as it allows each Order to contain multiple different products (the one-to-many relation), with particular
(order,product)-specific prices, quantities and discounts.
Hence we can think of this table as not being entity-centric but rather relation-centric. Any fields (e.g. the UnitPrice column) is information that should be attached to the relation between the order and the product. As Zef graphs we always have directed relations, so we need to also specify a "source" and "target" column:
My full processing of the CSV files is shown below. My edits to the
sql_import.yaml file are available below.
from zef import * from zef.ops import * from zef.experimental import sql_import decl = sql_import.guess_csvs("*.csv") decl | write_file["sql_import.yaml"] | run # ... edit sql_import.yaml externally... decl = "sql_import.yaml" | load_file | run | get["content"] | collect g = Graph() actions = sql_import.import_actions(decl) actions | transact[g] | run
After this import, the graph looks like:
>>> yo(g) <...snip...> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Atomic Entities ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [6007 total, 6007 alive] AET.String [2985 total, 2985 alive] AET.Float [2527 total, 2527 alive] AET.Int [2232 total, 2232 alive] AET.QuantityFloat.dollars [2469 total, 2469 alive] AET.Time ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Entities ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [1660 total, 1660 alive] ET.Order [154 total, 154 alive] ET.Product [12 total, 12 alive] ET.Employee [93 total, 93 alive] ET.Customer [53 total, 53 alive] ET.Territory [4 total, 4 alive] ET.Region [3 total, 3 alive] ET.Shipper [29 total, 29 alive] ET.Supplier [8 total, 8 alive] ET.Category ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Relations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [4 total, 4 alive] RT.RegionDescription [4 total, 4 alive] (ET.Region, RT.RegionDescription, AET.String) [2155 total, 2155 alive] RT.Order [2155 total, 2155 alive] (ET.Order, RT.Order, ET.Product) [9 total, 9 alive] RT.ReportsTo [9 total, 9 alive] (ET.Employee, RT.ReportsTo, ET.Employee) <...snip...>
If you don't want to do the import yourself, you can also look at the graph that I created at the tag
zef/blog/northwind, that is, you can access it via:
g = Graph("zef/blog/northwind")
As always, answering a simple question has opened up many more questions which I can't help myself but discuss below...
One final purpose: "field_on"
There's another addition that's needed to import many production SQL tables: a way to handle databases with optimised tables.
While we do not need this kind of purpose for the Northwind dataset, it is useful to mention it to "complete the story". The above example of the "Order Details" table which produced pure relations (and not entities) is because the Northwind dataset is in "first normal form".
A "denormalised" dataset could also be used for performance reasons. For example, obtaining all of the products ordered by a particular customer would require an SQL JOIN query to obtain:
SELECT DISTINCT(order_details.ProductID) FROM order_details INNER JOIN orders ON order_details.OrderID = orders.OrderID WHERE orders.CustomerID = "VINET";
Instead, the join of orders and order_details could itself be stored as a table (or view) in the SQL database:
The concept of avoiding joins is rather weird when coming from a graph perspective, as the equivalent of joins are trivial on a graph. However, you may not have any choice in the data you want to import. In this case, we can mark those columns as belonging to the relation between the order and product using "field_on":
Here the "Quantity" column cannot be set as a
field as it would have multiple values for the same
ET.Order. So instead we designate that it should be attached to the "target" column
ProductID. This means we could write a Zef query for the total quantity of an order as:
z_order | out_rels[RT.Product] | Out[RT.Quantity] | value | sum | collect
The imported tables still duplicate a lot of information. For example, each of the
ET.Employee have a
RT.Region which is a scalar string. This is not very connected, as it would be better to have these
RT.Regions point at a
ET.Region. That way, we could ask for all suppliers in the same region as a customer without doing string matching.
I was tempted to add this in as another purpose, something like
entity_from_matching_field. But this would have degenerated into providing an arbitrary language to describe the endless possible databases out there. But we already have a fully-featured language to manipulate this: Zef! So we can just post-process the imported Zef graph.
You might be worried about exposing the gorey details of the import process and only the post-processed data. This is easy, if we export only
g | now to a new graph after performing the post-processing.
A comment on speed
If you run the commands in this blog post on a large dataset, then you are going to be waiting several minutes for the import. Even as part of writing this post up, and using the Northwind dataset which is relatively small, I found I had to optimise some aspects of the
This is largely due to the current pure-python implementation of the evaluation engine for ZefOps. In the future this will be implemented in C++ along with the core of the GraphDelta code.
Import directly from SQL
To be honest this blog post is a little crude, using CSV files which lack type information rather than the SQL source directly. The information available in the SQL schema can also assist automatic detection of connected entities and whether a table represents a single entity or many-to-many relations.
The benefit of handling CSV files is it allows us to accomplish almost all kinds of imports, so long as we provide enough additional information. If we supported only SQL, then this would limit the flexibility.
The other reason that the SQL schema is not used directly, is that it's a fair bit more work to support a SQL connection or export. But our intent is to extend these tools and make imports as close to one-click as possible!
Analysis using external tools
I'll leave this for the next blog post! But as a sneak preview...
If you'd like to find out more about Zef and ZefHub (and get early access), get us out at zefhub.io.