GraphNesting/02_related_bis.tex at clean · gyankos/GraphNesting · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
% !TeX root = 00_paper_entrypoint.tex
\section{Related Work}

\subsection{Nested Graph Representations}
\textbf{Statecharts} \cite{statecharts} represent one of the first applications of nested graphs for complex systems modelling. This choice allowed the representation of multiple abstraction levels at the same time: each node represents a  state or ``configuration'' of the system, and each edge represents a transaction between two different states on a given event. The state containment of multiple states and edges allows multiple nesting levels.  As a consequence,  there is no distinction between (simple) states and states containing other states. Given that this model was not designed for data representations, vertices and edges are labelled but cannot contain any property-value association.
This model allows both \textit{external edges} and \textit{internal edges}: an edge  $e$ will be called \textit{external} if its source (or target) is contained by the target (or source) but neither of them contains $e$; the edge will be called \textit{internal} when the containing vertex (either its source or target) also contains the edge. Besides to state representation purposes such as UML diagrams, this model has also been  used for both modelling the evolution of \textit{pathophysiological} states and to describe the clinical treatments to which the patient must undergo. This model also allowed to subdivide each treatment  into smaller and consequential procedures \cite{NestedGlaucoma}.

Statecharts were also adopted as a basis for the subsequent \textbf{hypernode} data model \cite{Poulovassilis1994}. Like statecharts, this model does not allow to fully represent a property graph, since the attribute-value association must be necessarily expressed as a relation between two different vertices but, on the other hand, hypernodes allow neither edge labelling nor external and internal edges. A first extension of the hypernode model towards data representation is represented by CoGITaNT \cite{GenestS98}, where any type of edge (thus including internal and external ones) are included and data is firstly contained inside a node. Nested graphs can be represented as XML \cite{graphml,GXL}.

Current literature uses two different approaches for extending
% Two different approaches have been used to extend
graph databases to support nesting operations:
some try to overcome graph data structure limitations by extending their query languages, while others try to extend the data structures used for both input and intermediate computations. Among the first type of approaches, \cite{Etcheverry2012} proposes the definition of a RDF  vocabulary over which the OLAP  cube can be defined. On top of this ``structured'' RDF graph, an algorithm generates the SPARQL query that will allow to perform either the roll-up or the drill-down operation. This implies that each possible computation over the data view has to be always recomputed on top of the raw data as in ROLAP systems, thus thwarting the benefits of updating the intermediate query result. On the other hand, the last type of approaches has been recently widely investigated  and seems to be more promising with regard to optimization techniques. In these approaches \cite{Tian20085,ChenYZHY08,Qu2011}, graph data structures are associated with external graph indices and, thus, allow to connect one graph to a broader one with respect to the roll-up query. As a consequence, these solutions do not allow to freely expand any aggregate components at the same time but can only backtrack the aggregation to a previous known state. %Some further details are going to be provided on Chapter \vref{cha:nesting}, where such operator will be implemented on a specific algorithm.
%As it will be showed in Chapter \ref{cha:graphsdef}, in order to meet such goals the nesting indices are going to be directly embedded within the definition of the nested (graph) data model, thus allowing to extend all the aforementioned approaches.


\subsection{Databases and Query Languages}\label{subsec:pathsumm}
As previously mentioned, some recent  query languages support graph nesting semantics by exploiting the underlying data model's features. To express the query presented in our running example, these query languages must support as valid values  either vertices' and edges' id collections or nested representations.
For these reasons, we firstly select PostgreSQL which, by extending the SQL-3 syntax and by allowing JSON data, provides an \texttt{array\_agg} aggregation function. The latter collects (i.e., groups) the result-set into arrays.  We  represent  graphs by  storing edge triples as follows:\begin{center}
 \texttt{Edge}(\textit{edgeId},\;\textit{sourceId},\;\textit{edgeLabel},\;\textit{targetId})
\end{center} In our running example, nested vertices  are obtained by
grouping edges by \textit{sourceId} and collecting all target's id results via \texttt{array\_agg}. Similarly, our nested edges are obtained by joining consecutive \texttt{Edge}s and then grouping  by  distinct \textit{sourceId} and \textit{targetId}  vertices limiting the two hop path; the list of the \textsc{Paper}s' id is collected via \texttt{array\_agg}. The overall graph nesting cannot be created in one single SQL query, because we cannot distinctively group the same dataset in different ways. Instead, we must perform two distinct aggregations (see Listing \ref{SQLNesting} in Appendix).


Despite the fact that SPARQL  may represent the graph nesting query as a single statement, a \texttt{UNION} clause implies a separate visit for the two graph patterns. The first pattern presented in Listing \ref{SPARQLNesting} (see Appendix) allows to traverse those  patterns matching the coauthorship statement in Figure \ref{fig:pathPat}, so that they can be nested within the created \textsc{CoAuthorship} edge. The second part returns the remaining \textsc{Paper} associations that have been authored by one single \textsc{Author}. In the first case, the edge nesting is performed via the association of different \texttt{<http://cnt.io/nesting>} properties departing from one single  \textsc{coAuthorship} edge (\texttt{?newedge}).

ArangoDB is a document-oriented NoSQL database, which query language AQL allows the access and the creation of nested members. An example of how such graph nesting query can be carried out in AQL is presented in Listing \ref{AQLQueryNesting}: in this scenario we assume that we've previously loaded our graph data with the default ArangoDB format, where  vertices are indexed by id only and edges are additionally indexed by source and target vertex ids. Even though AQL returns JSON documents instead of relational tables, we can state that its resulting query plan is similar to PostgreSQL's query plan.
%
%Let us consider the following query that will be express in our different languages:
%``\textit{Total number of orders handled by each employee. Only list employees that handled more than 100 orders}''
%
%At this point we immediately notice that NautiLOD and G could not express the summarization since those languages
%are graph traversal languages for pattern matching: consequently they could only select paths and subgraphs but
%they do not aggregate (summarize) nodes. In this case a bag of both employees and the number of the orders
%is returned. Since the result is a bag of values, we cannot obtain a graph as an output, and hence we cannot establish some new edges through the employee and the aggregated value of the sales.
%
%\begin{lstlisting}[caption={Summarization query in Gremlin},language=gremlin,frameround=fttt,frame=trBL]
%graph = TinkerGraph.open()
%graph.io(IoCore.gryo()).readGraph('/path/to/graph')
%g = graph.traversal()
%
%g.V().hasLabel("Employee").match(
%      __.as("emp").in("SalesEmployee").hasLabel("Sales").count()
%                                      .as("ordersByEmployee"),
%      __.as("ordersByEmployee").is(gt(100))
%).select("emp", "ordersByEmployee")
%\end{lstlisting}
%\medskip
%
%The SPARQL query returns a table with two attributes, where the first is the Employee ID and the
%second element is the number of its handled orders. In this case the output is expressed as a table.
%
%\begin{lstlisting}[caption={Summarization query in SPARQL: Table},language=sparql,frameround=fttt,frame=trBL]
%PREFIX ex:<http://example.it/Relations#>
%
%SELECT	 ?emp, (COUNT (?sales) AS ?ordersByEmployee)
%WHERE    {
%          ?sales a                 ex:Sales;
%                 ex:SalesEmployee  ?emp.
%          ?emp   a                 ex:Employee.
%         }
%GROUP BY ?emp
%HAVING   COUNT(?orderNo) > 100
%\end{lstlisting}
%
%With the \texttt{\textbf{CONSTRUCT}} clause we could return the previous result inside an RDF Graph.
%\begin{lstlisting}[caption={Summarization query in SPARQL: Graph},language=sparql,frameround=fttt,frame=trBL]
%PREFIX ex:<http://example.it/Relations#>
%
%CONSTRUCT { ?ordersByEmployee ex:SalesEmployee ?emp. }
%WHERE {{
%	SELECT	 ?emp, (COUNT (?sales) AS ?ordersByEmployee)
%	WHERE    {
%	          ?sales a                 ex:Sales;
%	                 ex:SalesEmployee  ?emp.
%	          ?emp   a                 ex:Employee.
%	}
%	GROUP BY ?emp
%	HAVING   COUNT(?orderNo) > 100
%}}
%\end{lstlisting}
%\medskip
%
%In Cypher we could formulate a similar query with a tabular result as follows:
%\begin{lstlisting}[caption={Summarization query in Cypher: Table},language=cypher,frameround=fttt,frame=trBL]
%MATCH  (sales:Sales)-[:SalesEmployee]->(empl:Employee)
%WITH   empl AS emp, COUNT(sales) AS ordersByEmployee
%WHERE  ordersByEmployee > 100
%RETURN empl, ordersByEmployee
%\end{lstlisting}
%
%With this language we could even return a new graph, where
%the whole information of the employee is returned and where the films are aggregated.
%\begin{lstlisting}[caption={Summarization query in Cypher: Graph},language=cypher,frameround=fttt,frame=trBL]
%MATCH  (sales:Sales)-[:SalesEmployee]->(empl:Employee)
%WITH   empl AS emp, COUNT(sales) AS ordersByEmployee
%WHERE  ordersByEmployee > 100
%CREATE p=
%(:Sales {count: ordersByEmployee})-[:SalesEmployee]->(empl:Employee)
%RETURN p
%\end{lstlisting}
%\medskip
%
%Let us now focus on the BiQL language: firstly we cannot create new vertices that aggregate the results without
%updating the original database because the \texttt{\textbf{CREATE}} semantic has this precise meaning, secondly the
%\texttt{\textbf{CREATE}} clause does not allow to create multiple objects within the same query: this means that we
%cannot create new edges while creating new vertices. By the way we could return all the employees and store the
%number of the sales inside each node.
%\begin{lstlisting}[caption={Summarization query in Cypher: Graph},language=biql,frameround=fttt,frame=trBL]
%SELECT <empl>{empl.*, ordersByEmployee: count(salesEdge)}
%FROM   Employee empl <- SalesEmployee salesEdge
%WHERE  count(salesEdge) > 100
%\end{lstlisting}
%

Last, Listing \ref{Neo4JQuery} provides an example of Cypher: even if Neo4J's property graph model does not directly nest graphs inside vertices or edges, we can associate \textit{member} ids to each of them. This solution can be achieved by  first matching the vertex summarization pattern of Figure \ref{fig:vertexPat} and then performing an \textsc{Author} group by (\texttt{with}). Afterwards, we nest the set of authored \textsc{Paper}s via \texttt{collect}. Last, we match the path summarization pattern presented in Figure \ref{fig:pathPat} and group it by source and destination \textsc{Author}, then we create the \textsc{coAuthorship} edge containing the co-authored \textsc{Paper}'s id. As it will be observed within the benchmarks (see Section \ref{sec:nestexpeval}), the solution of not separating the elements' ids from their data quickly leads to an intractable solution.

In all the former query languages, the vertices undergoing  \texttt{GROUP BY}-s are either vertex or edge grouping references.