
Abstract
Due to its good expressivity, graph has been widely used to model the relationship among data elements. As a result, many large-scale graph processing systems have been proposed and deployed in the real world, and have achieved great successes. The optimization for these systems is also a hot research topic in both the academic and industry.
However, the current implementation of graph processing systems are typically based on certain simplification assumptions, such as“vertex is indivisible”and“each compute operation can be executed isolatedly”, and hence cannot achieve the best performance that the hardware can deliver. To resolve this problem, we investigate the field and found that the bottleneck of most graph applications is the speed of loading data. Based on this observation, we partition the load path of graph data into four stages and propose optimizations for them respectively. The main innovations of this book are as follows.
(1)We design a novel 3D graph partition algorithm. This algorithm is based on the fact that the vertex property of many graph applications is actually divisible. Through exploring this novel dimension that have never been considered by existing methods, our algorithm can reduce up to 90. 6%of the original communication cost.
(2)We define a layered graph organization format. This format enables the processing system to load more vertices at a time, and hence reduce the average loading times of each vertex. As a result, our method can decrease total disk I/O for out-of-the-core graph processing systems and leads to up to a 6. 4 times speedup.
(3)We propose an automatic optimization algorithm for matrix execution engine. This algorithm is mainly based on loop fusion and has also considered the consistency requirement of a distributed environment. It is able to assure the correctness, and simultaneously, achieve a speedup up to 5. 8 times.
(4)We implement a process-in-memory-oriented graph processing framework. By enforcing certain constraints in the programming model, our framework makes it possible to automatically remove redundancy in the communication. It also provides a broadcast-based optimization that reduce communication load on bottleneck links. According to our calculation, it can reduce the communication cost to as low as only 5% of the original.
Key words: graph computing; distributed computing; matrix computing; locality; PIM