Identifying cloned academics

Identifying cloned academics

Zef is great for representing, storing and retrieving data. It also provides an easy way to interface with other packages, such as NetworkX or scikit learn for analysis.

NetworkX is a convenient python package that exposes algorithms grouped into over 60 topics, including cluster detection, flow analysis, shortest-path calculation, etc... Hence, if we expose a proxy (otherwise known as a interface or wrapper) that mimics a NetworkX graph, but actually queries a Zef graph, then we can take advantage of all of these algorithms. That's what zef.experimental.networkx is for!

You can find some very early documentation on our docs page. This post will demo it on some real world data.

Example Project: Name Disambiguation

We look at a deceptively easy-looking problem of name disambiguation in academic journals. "Name disambiguation" can be more simply stated as:

Find authors with the same name who aren't the same person

For example, "John Smith" is a common English name, so it is natural to expect
multiple different people to publish under that name. Using just a person's name
to identify them is then obviously a problem, but this is often all that is available.

Without being able to distinguish two different John Smiths from one another, we have issues with questions like:

How many co-authors does "John Smith" publish with?

Without disambiguating various different John Smiths, the answer could be
tens of thousands, even though the average number of true co-authors for an individual
John Smith might be less than 100.

Disclaimer

True name disambiguation is a difficult problem and I am no expert on this. The analysis below is a bit of fun rather than any serious attempt to solve the name disambiguation problem. 

First off we need a bunch of academic publishing data. Instead of looking at journals, we will use the freely available open-access data of the arXiv pre-print database.

This is a fun exercise in itself, which I may post about later. For now, I have made it available as a Zef graph with tag blog/arxiv, which contains the data exported from 5 different topic for the period Jan-May 2022. This represents ~14,000 articles with ~44,000 author names.

There are also other features on the blog/arxiv graph, including categories. Here, we focus on only the author and preprint entities, ET.Author and ET.Preprint and their connecting relations RT.AuthoredBy. Note that the subgraph of these entities and relations is bipartite, which means there are only two kinds of entities and the connections are only between entities of different kinds (no direct author-author or preprint-preprint connections)

The example at the top of this post is a tiny subgraph of blog/arxiv.

Analysing the data using NetworkX

The kinds of algorithms we are going to be using are available as functions in the various NetworkX submodules, e.g.

nx.bipartite.clustering
nx.connectivity.edge_connectivity
...

but these functions take a NetworkX graph as their first argument. To expose a Zef graph to NetworkX we use the ProxyGraph wrapper object:

from zef.experimental.networkx import ProxyGraph

g = Graph("blob/arxiv")

# A lazy view on a Zef graph
ng_view = ProxyGraph(now(g),
                ET.Author | ET.Preprint,
                RT.AuthoredBy,
                undirected=True)
                
# Convert to a native NetworkX format where the labels are ZefRefs
# Faster for smaller graphs when laziness is not required.
ng = ng_view.to_native()

Now we can pass either ng Β or ng_view to many of the standard NetworkX algorithms. For example:

import networkx as nx
comps = list(nx.connected_components(ng))

returns a list of sets of entities which are connected to one another. This is our starting point as name disambiguation can only create more disconnected components.

Simple measures

We can't ask NetworkX to directly determine likely candidates for names are are actually two different people. Instead we have to ask questions like:

  • "Minimum cut": Which node, when removed, would break the graph in two?
  • "Local clustering": Which authors have a large fraction of co-authors that don't know each other?
  • "Clusters": What groups of nodes are in the graph where two groups are "bridged" by a single author or few authors.

The largest connected component holds over 60% of the total nodes of the graph. This is too large a component to visualise and explore directly (likely it is so large because of ambiguous naming!), so instead we look at a smaller connected component of only 57 nodes:

comps = comps | sort[length] | reverse | collect
sg = ng.subgraph(comps[4])
cols = list(sg) | match[
        (ET.Author, always["green"]),
        (ET.Preprint, always["blue"])
    ] | collect
nx.draw(sg, node_color=cols)
Authors (green) and preprints (blue) 

Where the larger, green circles are articles and the smaller blue circles are authors. By eye, we can see that it is likely that these are all valid unique individuals. However, there are many papers on the edges of the graph which are only linked to the rest via a single author. Let's try and identify these.

Attempt 1: minimum cut. If we ask NetworkX for a minimum cut we find that it (correctly) identifies that taking out one paper will split the graph in two:

edge_nodes = nx.minimum_node_cut(sg)

def get_cols_hl(nodes, edge_nodes):
    return list(nodes) | map[match[
        (Is[contained_in[edge_nodes]], always["red"]),
        (ET.Author, always["green"]),
        (ET.Preprint, always["blue"])
    ]] | collect
    
nx.draw(sg, node_color=get_cols_hl(sg, edge_nodes), node_size=get_sizes(sg))
A possible minimum cut shown in red

But this is not what we want, and we'd prefer to only consider authors which would split the graph in two. Noting that we have a bipartite graph, we can do this by projecting onto the authors only:

authors = list(sg) | filter[is_a[ET.Author]] | collect
authors_proj = nx.bipartite.projected_graph(sg, authors)
edge_nodes = nx.minimum_node_cut(authors_proj)
A possible minimum cut on the author-projected graph

This is good, the author name found this way could very well be two different people. But this is still not the best result, as this is only one of many different ways to split the graph into two.

Attempt 2: local clustering. We can instead look for authors whose co-authors naturally fall into separate groups. This can be done by looking for triangles/squares in the graph. The bipartate clustering of NetworkX does something similar:

bc = nx.bipartite.clustering(sg)
(bc | items
    | sort[second]
    | map[apply[first,
                first | rae_type,
                second]]
    | collect)
[(Node(#5469948), ET.Preprint, 0.015625),
 (Node(#3796972), ET.Preprint, 0.015625),
...
 (Node(#1551908), ET.Preprint, 0.1111111111111111),
 (Node(#2710468), ET.Author, 0.46923076923076906),
...
 (Node(#1417185), ET.Author, 0.521505376344086),
 (Node(#2025910), ET.Author, 0.75),
...
 (Node(#4794050), ET.Author, 0.9642857142857143),
 (Node(#4794094), ET.Author, 0.9642857142857143)]

The output above shows that there are distinct breaks in the local clustering coefficient. If we focus on the authors, then there is a jump around 0.5. Using this as a guide we can quickly identify interesting authors:

edge_nodes = (bc | items
                 | filter[first | is_a[ET.Author]]
                 | filter[second | less_than[0.5]]
                 | map[first]
                 | collect)
Authors with low local clustering coefficient

This has achieved the goal of finding many more opportunities to split the graph in two. However, it also looks like we have overcompensated as not all of these will be duplicate names. Starting from a larger candidate set is good though, as it is always possible to reduce it with further criteria.

Attempt 3: clustering. To show off a different method again, we can use the clustering methods of scikit learn. Here is one simple example:

import sklearn.cluster

A = nx.adjacency_matrix(sg)
labels = sklearn.cluster.spectral_clustering(A, n_clusters=4)
nx.draw(sg, node_color=labels)
Clustered groups

This is looking great. The choice of 4 clusters is somewhat arbitrary on my part, although I could imagine choosing a few clusters like this and iteratively breaking up the network. At each iteration, I would identify any clusters which are connected by a single author and be able to select those authors as candidates for name disambiguation. These boundaries could be found via:

boundary_authors = (
  list(nx.edge_boundary(sg, cluster[0], cluster[1])) 
    | concat
    | filter[ET.Author]
    | collect)

Further options

There are so many other ideas to attack this kind of problem. We could use the authors identified above as candidates for a more rigorous check of their likelihood as duplicate names. For example, we could separate out clusters and use flow techniques to see if the flow between different clusters is constricted at one particular author. We could also "colour" the RT.AuthoredBy edges by the topic of the articles and see if this naturally separates the groups.

But this is something I'd rather leave to people whose expertise lies in this area!

A little aside about projected graphs

NetworkX provides a way to calculate a projected graph, and we used this above. Here's what those graphs look like (keeping the same node positions and colouring):

However, sometimes it might be necessary to use more information to work out a projection than the simple bipartite data that has been made available to NetworkX. For example, we might want to consider a projection onto preprints, where any edge drawn between two preprints means they share at least one author AND they don't share any categories between them. Here's how we might go about this using pure Zef:

@func
def share_field(pair, rt):
    return (pair[0] | Outs[rt]
            | map[contained_in[pair[1] | Outs[rt] | collect]]
            | any
            | collect)
              
share_author = share_field[RT.AuthoredBy]
share_category = share_field[RT.InCategory]

# Place everything on a second graph using merging
g2 = Graph()

# Merging over the preprint nodes (without connections)
pp = list(sg.nodes) | filter[is_a[ET.Preprint]] | collect
pp | g2 | run

# Merging over the edges between preprints as RT.Connection's.
edges = (pp | combinations[2]
            | filter[And[share_author][Not[share_category]]]
            | collect)
edges | map[apply[first, always[RT.Connection], second]] | g2 | run

# Taking a view of this reduced graph
ng2 = ProxyGraph(now(g2), undirected=True)
nx.draw(ng2)
Projected graph of preprints that share authors but don't share topics

This graph is still fully connected, but the structure of it is somewhat different from the simple bipartite projection onto preprints.