GraB: A Distributed System for Querying Web-Scale Knowledge Graphs
OVERVIEW
GraB is a distributed system built for querying billion-node knowledge graphs through subgraph similarity matching. GraB uses a novel technique for bounding the matching scores during the computation to speed up query processing. By using the bounds, the low quality answers can be efficiently pruned without evaluating all the possible answers, so GraB can answer queries in a few seconds on billion-node graphs. The design of GraB is introduced in our paper ''Querying Web-Scale Information Networks Through Bounding Matching Scores'' in WWW 2015.
PREREQUISITES
First of all, download GraB (GraB-System.tar.gz). The package contains the source code of GraB and a DBLP dataset.
GraB is a MPI-based system, which is implemented by modifying Piccolo. To build and use GraB, you will need a minimum of the following:
CMake (> 2.6)
OpenMPI
Python (2.*)
gcc/g++ (> 4)
protocol buffers
If available, the following libraries will be used:
Python development headers; SWIG
TCMalloc
google-perftools
In addition to these, GraB comes with several support libraries which are
compiled as part of the build process; these are:
google-flags
google-logging
On debian/ubuntu, the required libraries can be acquired by running:
sudo apt-get install build-essential cmake g++ libboost-dev libboost-python-dev libboost-thread-dev liblzo2-dev libnuma-dev libopenmpi-dev libprotobuf-dev libcr-dev libibverbs-dev openmpi-bin protobuf-compiler liblapack-dev
the optional libraries can be install via:
sudo apt-get install libgoogle-perftools-dev python-dev swig
BUILDING
To build, simply run 'make' from the toplevel GraB directory. After building output should be available in the bin/ directory. Specifically, a successful build should generate a bin/{debug,release}/examples/example-dsm binary.
RUNNING
To execute a GraB program, you will need to modify conf/mpi-cluster to point to the set of machines GraB will be executed on - for example, a file might look like:
localhost slots=1
a slots=4
b slots=4
c slots=4
Which would allow for running up to 12 workers (+ 1 master process).
DATA GRPAH
The data graphs are stored using an adjacency list whose format is as follows.
Node ID \t Type; Neighbor IDs;
For example
1 -1 ;2 3 ;
2 -2 ;1 ;
3 -1 ;1 ;
This data graph has three nodes 1, 2, 3. Node 1 has type -1 and has two neighbors 2 and 3. Node 2 has type -2 whose neighbor is 1. Node 3 has type -1 whose neighbor is 1.
When executing GraB, the text-based data graph should be converted to a binary-based data graph that uses file-based hashing to store the graph.
QUERY GRAPH
The query graphs are stored using an adjacency list whose format are similar to the data graph except that the query graphs contain the unknown query nodes.
For example
1 -1 ;-1 2 ;
-1 -2 ;1 ;
2 -1 ;1 ;
This query graph has three nodes 1, -1, 2. Node 1 has type -1 and has two neighbors node -1 and 2. Node -1 has type -2 whose neighbor is 1. Node 2 has type -1 whose neighbor is 1. We want to find the data nodes match query node -1 in the data graph.
CREATING DATA GRAPH
Our data graphs are created by converting the RDF graphs. The conversion is implemented in another project which performs the conversion using Hadoop.
EXAMPLE: RUN GRAB ON UBUNTU
Environment: 4 worker+1 master on a single machine
Step 1) Download and extract GraB
wget http://rio.ecs.umass.edu/mnilpub/download/GraB-System.tar.gz ./
gunzip GraB-System.tar.gz
tar -xvf GraB-System.tar
export GRAB_HOME=`pwd`/GraB-System
Step 2). Setup compiling environment
On debian/ubuntu, the required libraries can be acquired by running:
sudo apt-get install build-essential cmake g++ libboost-dev libboost-python-dev libboost-thread-dev liblzo2-dev libnuma-dev libopenmpi-dev libprotobuf-dev libcr-dev libibverbs-dev openmpi-bin protobuf-compiler liblapack-dev
the optional libraries can be install via:
sudo apt-get install libgoogle-perftools-dev python-dev swig
Step 3). Build GraB
cd $GRAB_HOME/code
make
Step 4). Configure computing cluster
vi $GRAB_HOME/conf/mpi-cluster
cat $GRAB_HOME/conf/mpi-cluster
localhost slots=5
Step 5). Configure data placement. Please replace ''localhost'' by your hostname. Note that the columns are split by '\t'.
vi $GRAB_HOME/conf/dataplacement
cat $GRAB_HOME/conf/dataplacement
dblp-4 0 localhost (replace it by your hostname)
dblp-4 1 localhost
dblp-4 2 localhost
dblp-4 3 localhost
Step 6). Copy data graph to shared memory
cp $GRAB_HOME/dataset/dblp/file-based-hashing/bin-* /dev/shm/
Step 7). Generate query graphs (if query graphs have been generated, just skip this step)
cd $GRAB_HOME/
sh query-generator.sh
Step 8). Query processing
cd $GRAB_HOME/
sh query-processing.sh
Please contact Dr. Jiahui Jin if you have any issues (jhjinwz AT gmail DOT com).