GraphNesting/06_thosp_short.tex at clean · gyankos/GraphNesting · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
% !TeX root = 00_paper_entrypoint.tex


\begin{figure*}[!ht]

\begin{tabular}{c}
	{$AdjFile[o]\equiv\Braket{\textbf{\textsc{Header}},\nu(o),\epsilon(o),\Braket{\ell(o),\omega(o)},\texttt{outLen},\textit{\textsc{OutgoingEdges[]}},\texttt{inLen},\textit{\textsc{IngoingEdges[]}}}$} \\
	$\textit{\textsc{OutgoingEdges[}i\textsc{]}}\equiv\textit{\textsc{IngoingEdges[}i\textsc{]}}\equiv \textsc{\textbf{EdgeEntry}[}i\textsc{]}$ \\
	\linebreak \\
	{$\textbf{\textsc{Header}}:=\Braket{\texttt{length},\texttt{id},\texttt{hash},\texttt{ellOffset},\texttt{epsilonOffset},\texttt{contentOffset},\texttt{outOffset},\texttt{inOffset}}$}\\ $\textsc{\textbf{EdgeEntry}[}i\textsc{]}:=\Braket{\textsc{id[}i\textsc{]},\textsc{hash[}i\textsc{]},\textsc{adjVertexId[}i\textsc{]},\textsc{adjVertexHash[}i\textsc{]}}$\\
\end{tabular}
	\caption{Serialized data structure representing an extended adjacency list for one nested vertex $o$. The header contains some basic information (such the representation size of $o$, its id and associated hash) and the offsets to the remaining fields. $\nu$ and $\epsilon$ are empty when the serialized graph represents a basic property graph as the one in Figure \ref{fig:inputbibex2}.}\label{nestedGraphVertex}
\end{figure*}
\section{Graph Nesting}\label{sec:nestingdef}
We now define the graph nesting operation, which uses the outcome of the following partial and overlapping graph clustering operator (e.g., graph pattern matching) to provide structural aggregation:
%The graph nesting operator uses a classifier function grouping all the vertices and edges that shall appear as a member of a cluster $C$.

\begin{definition}[Nested Graph Classifier, $g_\kappa$]
	Given a set of cluster labels $\;\mathcal{C}$, a \textbf{nested graph classifier} function $g_\kappa$ maps a nested graph $G_o$ into a nested  graph collection $\{G_C\}_{C\in\mathcal{C},G_C\neq \emptyset}$ of subgraphs of $G_o$. Such function uses a classifier function $\kappa\colon \VS\cup \ES\to \wp(\mathcal{C})$ mapping each vertex or edge in either no graph or at least one non-empty subgraph. Each nested graph $G_C$ is a pair $G_C=\Braket{\VS_C,\ES_C}$
	where $\VS_C$ (and $\ES_C$) is the set of all the vertices $v$ (and edges $e$) in $G_o$ having $C\in \kappa(v)$ (and $C\in \kappa(e)$). Therefore, the nested graph classifier is defined as follows:
	\[g_\kappa(G_o)=\Set{\Braket{\VS_C,\ES_C}|C\in\mathcal{C},(\VS_C\neq\emptyset\vee\ES_C=\emptyset)}\]
\end{definition}

When $\kappa$ is a pattern matching graph as in Figure  \ref{fig:patterns} as possible $\kappa$,   $\kappa\xrightarrow{f_C} G_C$ is the function $f_C$ associating  each vertex (and edge) in $\kappa$ to possibly more than one vertex (and one edge) in a subgraph $G_C\in g_\kappa(G_o)$. In order to represent the latter subgraphs  as either vertices and edges, we may use
the following \textsc{User-Defined Functions}:
\begin{definition}[User-Defined Functions]
	An \textbf{object user defined function} $\mu_\Omega$ maps each subgraph $G_C\in g_\kappa(G_o)$ into a pair $\mu_\Omega(G_C)=(L,t)$, where $L\in\wp(\Sigma^*)$ is a set of labels and $t$ is a relational tuple.

	An \textbf{edge user defined function} $\mu_E$ maps each subgraph $G_C\in g_\kappa(G_o)$ into a pair of identifiers $\mu_E(G_C)=(s,t)$ where $s,t\in\mathbb{N}$.
\end{definition}

\begin{ex}
Within our use case scenario, $\mu_\Omega$  must associate  the authors' informations to each nested vertex resulting from $g_{V}(G_{o})$ , and create nested edges with \textsc{coAuthorship} label and no associated tuple:
\[\mu_\Omega(G_C)=\begin{cases}
([\mstr{coAuthorship}],\;\emptyset) & G_C \in g_E(G_{o})\\
(\ell(f_C(\gamma_V))),\;\omega(f_C(\gamma_V))) & G_C \in g_V(G_{o})\\
\end{cases}\]
\end{ex}

While $\mu_\Omega$ may be used for transforming subgraphs to both vertices and edges, $\mu_E$ is only used to map subgraphs to edges.
In order to complete such transformation, we have to map each graph in $g_\kappa(G)$ into a new id $(\textbf{c},\textbf{i})\notin \VS\cup \ES$, for which an indexing function $\iota_G$ over each $G_C$ has to be defined within our specific task. As we will see in the next section, our scenario provides some constraints on both patterns; this allows the definition of an indexing function  uniquely associating each matched subgraph  $G_C$  to the grouping references' ids.
The previous functions are involved in the definition of our general graph nesting operator, generalizing current literatures' graph summarization and roll-up operators:


\begin{definition}[Graph Nesting]
Given a nested graph $G_{(c,i)}$ within a nested graph database $G$, an object user defined function $\mu_\Omega$, an edge user defined function $\mu_E$ and an indexing function $\iota_G$, the graph nesting operator $\eta_{g_V,g_E,\mu_\Omega,\mu_E,\iota_G}^{\textbf{keep}}$ converts each subgraph in $G_C\in g_V(G_{(c,i)})$ (and $G_C\in g_E(G_{(c,i)})$) into a nested vertex (and nested edge) $\iota_G(G_C)$ and adds them in a newly-created nested vertex; vertices and edges in $G_{(c,i)}$ appearing neither in a nested vertex nor in a nested edge may be also returned if $\textbf{keep}$ is set to \texttt{true}. This operator returns the following nested graph:
\[\begin{split}
\eta&{}_{g_V,g_E,\mu_\Omega,\mu_E,\iota_G}^{\textbf{keep}}(G_{(c,i)})=G_{(\overline{c},dt(i))}=\\
&=\Big\langle \{v\in \nu(c,i) | V(v)=\emptyset\wedge\textbf{keep} \}\cup \iota_G(g_V(G_{(c,i)})),\\
&\qquad \{e\in \epsilon(c,i) | E(e)=\emptyset\wedge\textbf{keep} \}\cup \iota_G(g_E(G_{(c,i)}))\Big\rangle\\
\end{split}\]
where $\overline{c}=\max\{c|(c,i)\in\mathcal{V}\cup\mathcal{E}\}+1$. As a side effect of the graph nesting operation, the nesting graph database is updated using the nested graph classifier and user defined functions as follows:
	\begin{align*}
	\Big\langle&\VS\cup \iota_G(g_V(G_{(c,i)}))\cup\{(\overline{c},i)\},\quad \ES\cup \iota_G(g_E(G_{(c,i)})),\\
	& \lambda\oplus \bigoplus_{G_C\in g_E(G_{(c,i)})}\iota_G(G_C)\mapsto \mu_E(G_C),\\
	& \ell\oplus\bigoplus_{G_C\in g_E(G_{(c,i)})\cup g_V(G_{(c,i)})}\iota_G(G_C)\mapsto\texttt{fst}\;\mu_\Omega(G_C),\\
	& \omega\oplus\bigoplus_{G_C\in g_E(G_{(c,i)})\cup g_V(G_{(c,i)})}\iota_G(G_C)\mapsto\texttt{snd}\;\mu_\Omega(G_C),\\
	& \nu\oplus \bigoplus_{G_C\in g_E(G_{(c,i)})\cup g_V(G_{(c,i)})}\iota_G(G_C)\mapsto \mathcal{V}_C\\
	& \;\; \oplus (\overline{c},dt(i))\mapsto \{v\in \nu(c,i) | V(v)=\emptyset\wedge\textbf{keep} \}\cup \iota_G(g_V(G_{(c,i)})), \\
	& \epsilon\oplus \bigoplus_{G_C\in g_E(G_{(c,i)})\cup g_V(G_{(c,i)})}\iota_G(G_C)\mapsto \mathcal{E}_C\\
	& \;\; \oplus (\overline{c},dt(i))\mapsto\{e\in \epsilon(c,i) | E(e)=\emptyset\wedge\textbf{keep} \}\cup \iota_G(g_E(G_{(c,i)}))\Big\rangle\\
	\end{align*}
	where $(f\oplus g)(x)$ returns $g(x)$ if $x\in\textup{dom}(g)$ and $f(x)$ otherwise, and both $f$ and $g$ are finite domain functions. $a\mapsto b$ denotes a finite function, which domain contains only $a$.
\end{definition}

The following example describes the outcome of the graph nesting process.

\begin{ex}[label=exNest]
Figure \ref{fig:outputnested} provides the result of $\eta$ when the non-traversed vertices and edges are not preserved ($\textbf{keep}=\texttt{false}$) and where $V$ and $E$ are the ones represented in Figure \ref{fig:bibex2}. As showed by the former definition, the nesting operation updates the nested graph database by creating new nested vertices ($(1,0),(1,1),(1,2)$) and nested edges ($(1,3),(1,5),(1,7),(1,8)$). Such nested components are contained within the returned nested graph $G_{(1,11)}$, which is represented as a nested vertex with the following members:
\[\nu(1,11)=\{(1,0),(1,1),(1,2)\}\quad \epsilon(1,11)=\{(1,3),(1,5),(1,7),(1,8)\}\]
The nested graph database updated as a side effect of the graph nesting may be represented as follows:
\[\begin{split}
G'=\big\langle &\{(0,0),(0,1),\dots,(0,5),(0,11),(1,0),(1,1),(1,2),(1,11)\},\\
	  & \{(0,6),\dots,(0,10),(1,3),(1,5),(1,7),(1,8)\}, \\
	  & \lambda',\ell',\omega',\nu',\epsilon'\big\rangle\\
\end{split}\]
Let us now focus on the nested vertices and edges of $G_{(1,11)}$. As requested by the UDF functions, each resulting nested \textsc{Author}(2) preserves the original  vertices' tuple information, and its vertex members correspond to the \textsc{Paper}s authored by the corresponding \textsc{Author}(1). For easing the nested graph representation, we assume that each \textsc{Author}(2) has an associated  id $(1,i)$, which derives from a simple vertex with id $(0,i)$ in $G_{(0,11)}$. Therefore, vertex $(1,0)$ is represented as follows:
\[\ell'(1,0)=[\mstr{Author}]\quad\nu'(1,0)=\{(0,3)\}\quad \epsilon'(1,0)=\emptyset\] \[\omega'(1,0)=\{\texttt{\textbf{name}}\colon\texttt{Abigail},\texttt{\textbf{surname}}\colon\texttt{Conner}\}\]

Last, each resulting nested edge \textsc{coAuthorship} has a \mstr{coAuthorship} label, it has no tuple information and its vertex members correspond to the \textsc{Paper}s coauthored by source and target \textsc{Paper}.
For easing the nested graph representation, we assume that each \textsc{coAuthorship} edge $(1,a)\to (1,a')$ has an associated id $(1,\sum_{k=0}^{a+a'}k\;+\;a')$, which derives from the grouping references. Therefore, edge $0\to 2$ in Figure \ref{fig:outputnested} is represented as follows:
\[\ell'(1,5)=[\mstr{coAuthorship}]\;\nu'(1,5)=\{(0,3)\}\;\epsilon'(1,0)=\emptyset\; \omega'(1,0)=\{\}\]
\end{ex}


\section{Two HOp Separated Patterns Algorithm} \label{sec:THOSPA}
Walking on the footsteps of relational databases, where algorithms were provided for specific instances of relational operators, THoSP provides an implementation for a specific graph nesting task. We now propose a sequential algorithm requiring a preliminary phase, where the operand is loaded into secondary memory using the \textbf{input data  representation}, and where no ancillary indexing data structures are serialized. Such representation is presented in Figure \ref{nestedGraphVertex}: it is an extension of the graph adjacency lists serialization previously proposed in \cite{bergamimm17} where each vertex $o$ has an associated \textbf{\textsc{Header}} containing its id ($o$), its associated hash and the offset pointing to other serialzied fields, such as the labelset and eventually its property-value representation ($\Braket{\ell(o),\omega(o)}$). Last, for each edge we store its id and hash value, as well as the hash and the id of the adjacent vertex. Hash values are used within the proposed THoSP algorithm to store the correspondences with the graph patterns in Figure \ref{fig:patterns}; therefore, each $\ell(o)$ is associated to a distinct hash value  $h(o)$. We also suppose that the input graph data to be serialized does not represent an exact adjacency list: for this reason, the graph is initially created in primary memory without the offset information and afterwards serialized into secondary memory.

In order to solve our specific graph nesting problem as presented in Example \vref{ex2}, we have to formally determine the $\eta$ parameters representing THoSP.
We focus on vertex (and edge) summarization patterns which grouping references associate unique vertices to distinct matching subgraphs; they require that:
\[\forall G_C\in g_V(G_o).\neg\exists G_d\in g_V(G_o).\; G_c\neq G_d\wedge f_C(\gamma_V)=f_D(\gamma_V)\]
\[\begin{split}
\forall G_C\in g_E(G_o).\neg\exists &G_d\in g_E(G_o).\; G_c\neq G_d\\
 &\wedge f_C(\gamma_E^{src})=f_D(\gamma_E^{src})\wedge f_C(\gamma_E^{dst})=f_D(\gamma_E^{dst})\\
\end{split}\]
This requirement leads to a one-to-one mapping between subgraphs $G_C$ and vertices matched by vertex (or edge) grouping references, that can be expressed by the following indexing function:
	\[\iota_G(G_C)=\begin{cases}
(\overline{c},dt(\texttt{snd}f_C(\gamma_V))) & G_C \in g_V(G_{o})\\
(\overline{c},dt\left(\texttt{snd}f_C(\gamma_E^{src}),\;\texttt{snd}f_C(\gamma_E^{dst})\right)) & G_C \in g_E(G_{o})\\
\end{cases}\]
$dt$ is an arbitrary bijection associating a  $n$-tuple in $\mathbb{N}^n$ to one single integer $\mathbb{N}$ (see Appendix B).
This assumption permits a deterministic $\mu_E$ function, associating to each newly created nested edge from $G_C$  two nested vertices having $f_C(\gamma_E^{src})$ and $f_C(\gamma_E^{dst})$ as vertex grouping references:
	\[\mu_E(G_C)=\Braket{(\overline{c},dt(\texttt{snd}f_C(\gamma_E^{src}))),\;(\overline{c},dt(\texttt{snd}f_C(\gamma_E^{dst})))}\]
Please note that the former function provides such association without additional join costs.

\begin{algorithm}[!t]
	\caption{Two HOp Separated Patterns Algorithm (THoSP)}\label{alg:THoSPAlgorithm}
	\begin{adjustbox}{max width=\textwidth}
		\begin{minipage}{1\linewidth}
			\algrenewcommand\algorithmicindent{1em}
			\begin{algorithmic}[1]
				\Procedure{doNest}{$Index,\textsc{pattern},f, \protect\overrightarrow{memb} =\{m_1,\dots,m_n\}$}
				\For{\textbf{each} $m_i\in \protect\overrightarrow{memb}$ \textbf{s.t.} \textsc{pattern}($\protect\overrightarrow{memb}$).doSerialize($m_i$)}\label{patternrequire}
				\State{$Index$.write($\Braket{f,m_i}$)}
				\EndFor
				\EndProcedure
				\State

				\Procedure{$\eta^{\texttt{false}}_{g_{V}\;,g_{E},\dots}$}{$\nested$}
				\State \textsc{File} $AdjFile$ = \textsc{Open\_MemoryMap}($\nested$);\label{openMMAP}\Comment{Serialized operand}
				\State \textsc{File} $Nesting$ = \textsc{Open}(\textbf{new}); \Comment{$\nu \cup\epsilon$, member information}
				\State \textsc{Adjacency} $toSerialize$ = \par \textbf{new} \textsc{Map<Vertex,<Edge,Vertex>>}(); \Comment{Nested graph, adj. list}
				\State {$\alpha:=V\cap {E}\backslash(\gamma_V\cup\gamma_E^{src}\cup\gamma_E^{dst})$;}\Comment{Shared pattern} \label{restriction}
				\For{\textbf{each vertex} $v'$ in $AdjFile$}\Comment{$v':=(c,v)$}\label{firstJoin}
				\If{$v'\vDash\alpha$}\label{vdashalpha}
				\For {\textbf{each} $ (u',e,v')\vDash V$}\Comment{$u':=(c',u)$} \label{substantiallyIs}
				\State{$\overline{u}:=(\overline{c},dt(u))$} \Comment{$\iota_G$, nested vertex}
				\State{\textsc{doNest}($Nesting$, V, $\overline{u},\{u',e,v'\}$)}
				\For {\textbf{each} $ (w',e,v')\vDash V$}\Comment{$w':=(c'',w)$} \label{substantiallyIs2}
				\If{$ (u',e,v',e',w')\vDash E$}\label{thereIsEdge}
				\State{$\overline{w}:=(\overline{c},dt(w))$}  \Comment{$\iota_G$, nested vertex}
				\State{$\overline{e}:=(\overline{c},dt(u,w))$} \Comment{$\iota_G$, nested edge}
				\State{\textsc{doNest}($Nesting$, E, $\overline{e},\{u',e,v',e',w'\}$)}
				\State{$toSerialize$.put($\overline{u}$,$\Braket{\overline{e},\overline{w}}$)}
				\EndIf
				\EndFor
				\EndFor
				\EndIf
				\EndFor
				\State $AdjFile$.serialize($toSerialize$);
				\State \Return{($AdjFile$,$Nesting$)}\label{serialize}\Comment{Nested graph}

				\EndProcedure
			\end{algorithmic}
		\end{minipage}
	\vspace{-2em}
	\end{adjustbox}
\end{algorithm}

Algorithm \ref{alg:THoSPAlgorithm} provides the desired interpretation for the two pattern matching graphs returning the desired nested graph.
%provides the desired implementation of the THoSP algorithm using the outcome of the previous preprocessing. We first restrict $\alpha$ to one single vertex and two edges
 After opening the previously-loaded graph operand through memory mapping (Line \ref{openMMAP}), we must first identify a sub-pattern $\alpha$ (Line \ref{restriction}) that is going to be visited only once within the graph (Line \ref{vdashalpha}), after which either the vertex or the path summarization pattern can be visited in their entirety. We also perform some restrictions over these patterns enhancing such optimizations: for each vertex $v'$ matched by $\alpha$ (Line \ref{vdashalpha}) we know that we must (possibly) visit all the edges going from $v'$ towards the vertices $\gamma_E^{src}$ and $\gamma_E^{dst}$.  Therefore, having an edge as a constraint in $\alpha$ linking $v$ towards $\gamma_E^{src}$ or $\gamma_E^{dst}$ both in $E$ and $V$  reduces the graph visiting time to the actual edges traversed from $v'$ meeting the grouping references (Line \ref{thereIsEdge}). Therefore, we know when we finish  our patterns' instantiation after exhaustively matching all the elements within the pattern.
As a consequence, a ``path join'' is performed between the two nested patterns (Line \ref{firstJoin} with \ref{substantiallyIs2}): this is evident from the two vertex nested for loops appearing in the algorithm.

\input{99_results_table.tex}

Our physical data model differentiates the \textit{input data  represnetation} from the \textbf{query result} (Line \ref{serialize}).
We suppose that the latter is only used by the user to read the outcome of the nesting process as in other query languages (such as SPARQL and SQL) and does not have to
produce ``materialised views''. Therefore, the result of the graph query itself can postpone the creation of a complete ``materialised view'', which will later use the same representation of the input data by using both the id information and the application of the User Defined Functions. In particular, the former $dt$ function is used to associate both the nested vertices, $\overline{u}$ and $\overline{w}$, and the nested edge $\overline{e}$ to their grouping references. This allows to easily reconstruct the complete nested graph information by using the inverse function of $dt$, thus allowing the postponed application of the user-defined functions.
Last, the \textsc{doNest} procedure performs the association between the nested vertices (and edges) $f$ and its members within the input graph $m_i$. When the pattern requires that $m_i$ should be a member of $f$ in the final nested graph,  \textsc{doNest} stores those membership associations as pairs $\Braket{f,m_i}$ in a $Nesting$ file. By doing so, we omit the \texttt{GROUP BY} cost which affects the previously seen query languages.

%% TODO: Please note that if in $g_E$ there is no path connecting $\alpha$ to $\gamma_E^{src}$ or $\gamma_E^{dst}$, the problem may quickly become cubic with regard to the size of the vertices, because we must create all the possible permutations where $v'$ is present alongside another element matching $\gamma_E^{src}$ or $\gamma_E^{dst}$.