Introduction
A self-join is a relational join operation which joins a table to itself, typically based on equivalence predicates. For example, given the table \(\mathsf{knows}(\mathsf{person1\_id}, \mathsf{person2\_id})\), representing a social graph where people know each other, and aliases of this table \(\mathsf{knows1}, \mathsf{knows2}, \ldots\), take the following query:
SELECT knows1.person1_id AS pA, knows1.person2_id AS pB, knows2.person2_id AS pC
FROM knows knows1
JOIN knows knows2
ON knows2.person1_id = knows1.person2_id;
The query specifies a self-join on the \(\mathsf{knows}\) table, \(\mathsf{knows1} \bowtie \mathsf{knows2}\), and produces two-hop paths in the social graph.
In database literature, self-joins often have a “bad reputation”:
- Graph Processing: A Panoramic View and Some Open Problems: The direct relational mapping of RDF data is characterized as “Easy to implement but [having] too many self-joins” (p51)
- Tables are Not Your Friends: Graph Databases: “The relational database is completely capable of modeling […] networks through the use of foreign keys and self-joins. Unfortunately, the RDBMS generally hits performance issues when working with very large graphs […]” (p1)
- Avoiding SQL anti-patterns (Google BigQuery documentation): “Best practice: Avoid self-joins. […] Typically, self-joins are used to compute row-dependent relationships. The result of using a self-join is that it potentially squares the number of output rows. This increase in output data can cause poor performance.”
- Building an Efficient RDF Store Over a Relational Database: “Self-joins are expensive and therefore we use the following strategy to avoid them […]” (p9)
- LIquid: The soul of a new graph database, Part 2: “For a static query planner based on table-level statistics, self joins are intensely problematic.” and “If everything is stored as edges, we’re going to be doing a lot of joining, specifically self-joins. Every query with small intermediate results will almost certainly devolve into random access.”
So a general sentiment about self-joins is that they are expensive operations which are difficult to optimize. Let’s dive into why this may be the case.
Problems with self-joins
P1. Extra complexity introduced by storage
The RDF data model represents data as \((\mathsf{subject}, \mathsf{predicate}, \mathsf{object})\) triples. A popular approach is to store these triples in a single table, e.g.:
CREATE TABLE triples(subject BIGINT, predicate VARCHAR, object VARCHAR);
INSERT INTO triples VALUES
(1, 'name', 'Joe'),
(1, 'birthday', '1994-08-12'),
(1, 'location', 'Melbourne')
;
To query three attributes of the person with ID 1, we need to perform two self-joins:
SELECT
1 AS id,
triples1.object AS name,
triples2.object AS birthday,
triples3.object AS location
FROM triples triples1
JOIN triples triples2
ON triples2.subject = 1
AND triples2.predicate = 'birthday'
JOIN triples triples3
ON triples3.subject = 1
AND triples3.predicate = 'location'
WHERE triples1.subject = 1
AND triples1.predicate = 'name';
┌────┬──────┬────────────┬───────────┐
│ id │ name │ birthday │ location │
├────┼──────┼────────────┼───────────┤
│ 1 │ Joe │ 1994-08-12 │ Melbourne │
└────┴──────┴────────────┴───────────┘
While RDF’s representation is flexible, it leads to queries where fetching each additional attribute requires a self-join, introducing a random data access. Moreover, the optimizer has less information on the type (string
, int
, etc.) and distribution of each attribute, so it often produces suboptimal query plans.
In this context, self-joins are bad because they introduce unnecessary complexity that could be avoided by using a better schema:
CREATE TABLE person(id BIGINT, name VARCHAR, birthday VARCHAR, location VARCHAR);
INSERT INTO person VALUES(1, 'Joe', '1994-08-12', 'Melbourne');
SELECT id, name, birthday, location
FROM person;
This produces the same result without any joins by simple running a table scan.
P2. Cyclic queries
Self-joins can be used to express “cyclic queries”, i.e. queries which define a subgraph with a cycle in it. A common example is the triangle query \(\mathsf{knows1} \bowtie \mathsf{knows2} \bowtie \mathsf{knows3}\) (with the join predicates specifying a cycle):
SELECT knows1.person1_id AS pA, knows1.person2_id AS pB, knows2.person2_id AS pC
FROM knows knows1
JOIN knows knows2
ON knows2.person1_id = knows1.person2_id
JOIN knows knows3
ON knows3.person1_id = knows2.person2_id
AND knows3.person2_id = knows1.person1_id;
In 2012, it was proven that for a graph with \(m\) edges, any query plan using binary join operators can result in a worst-case complexity of \(O(m^2)\). While the proof uses an adversarial example graph, the excessive complexity is caused by the skew of the degree distribution of the graph, which is exhibited by many graphs in practice.
Interestingly, using a multi-way join operator, e.g. \(\bowtie(\mathsf{knows1}, \mathsf{knows2}, \mathsf{knows3})\), and a worst-case optimal join algorithm to evaluate it, a worst-case complexity of \(O(m^\frac{3}{2})\) can be guaranteed, making the problem tractable on large, billion-edge graphs.
In this context, self-joins are perceived as “bad” as they often occur in cyclic queries and such queries cannot be efficiently evaluated with the currently available industry RDBMSs. However, self-joins are necessary to specify such queries and evaluating them is possible using recently invented algorithms.
P3. Expensive hash joins
To compute two-hop neighbourhoods in a social network as shown in the introduction, we need the evaluate the expression \(\mathsf{knows1} \bowtie \mathsf{knows2}\). RDBMSs typically use hash joins this. During a hash join, the system first selects the smaller one of its input tables and builds a hash table on it. Then, it scans the other input table and finds the matching tuples using the hash table.
For a self-join where there are no predicates restricting either input side, the inputs are of equal size and there is no opportunity to select the smaller one. Therefore, a large hash table has to be built. Moreover, the resulting table is likely to contain many rows, potentially quadratic to the size of its input.
Note that representing the output of a large self-join more efficiently can be done using factorized representations. However, this technique is currently only available in research prototypes (e.g. AvantGraph and GraphflowDB).
Summary
Self-joins are not inherently bad but they often arise as or incur accidential complexity. We have discussed two such problem categories: P1. using a triple-based representation and P2 evaluating cyclic queries with binary join. These could be avoided by using better storage and more advanced algorithms. In our third category, P3, self-joins deal with the essential complexity of a combinatorially expensive problem. Still, there is room for improvement here: the output can be stored more efficiently with factorized representations.