|
Big graph search: challenges and techniques
Shuai MA,Jia LI,Chunming HU,Xuelian LIN,Jinpeng HUAI
Front. Comput. Sci.. 2016, 10 (3): 387-398.
https://doi.org/10.1007/s11704-015-4515-1
On one hand, compared with traditional relational and XML models, graphs have more expressive power and are widely used today. On the other hand, various applications of social computing trigger the pressing need of a new search paradigm. In this article, we argue that big graph search is the one filling this gap. We first introduce the application of graph search in various scenarios. We then formalize the graph search problem, and give an analysis of graph search from an evolutionary point of view, followed by the evidences from both the industry and academia. After that, we analyze the difficulties and challenges of big graph search. Finally, we present three classes of techniques towards big graph search: query techniques, data techniques and distributed computing techniques.
References |
Related Articles |
Metrics
|
|
String similarity search and join: a survey
Minghe YU,Guoliang LI,Dong DENG,Jianhua FENG
Front. Comput. Sci.. 2016, 10 (3): 399-417.
https://doi.org/10.1007/s11704-015-5900-5
String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been extensively studied in the recent decade, there is no thorough survey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for string similarity search and join. We also discuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Finally, we provide some open datasets and summarize some research challenges and open problems.
References |
Supplementary Material |
Related Articles |
Metrics
|
|
Event analysis in social multimedia: a survey
Xueliang LIU,Meng WANG,Benoit HUET
Front. Comput. Sci.. 2016, 10 (3): 433-446.
https://doi.org/10.1007/s11704-015-4583-2
Recent years have witnessed the rapid growth of social multimedia data available over the Internet. The age of huge amount of media collection provides users facilities to share and access data, while it also demands the revolution of data management techniques, since the exponential growth of social multimedia requires more scalable, effective and robust technologies to manage and index them. The event is one of the most important cues to recall people’s past memory. The reminder value of an event makes it extremely helpful in organizing data. The study of event based analysis on social multimedia data has drawn intensive attention in research community. In this article, we provide a comprehensive survey on event based analysis over social multimedia data, including event enrichment, detection, and categorization. We introduce each paradigm and summarize related research efforts. In addition, we also suggest the emerging trends in this research area.
References |
Supplementary Material |
Related Articles |
Metrics
|
|
A survey of cloud resource management for complex engineering applications
Haibao CHEN,Song WU,Hai JIN,Wenguang CHEN,Jidong ZHAI,Yingwei LUO,Xiaolin WANG
Front. Comput. Sci.. 2016, 10 (3): 447-461.
https://doi.org/10.1007/s11704-015-4207-x
Traditionally, complex engineering applications (CEAs), which consist of numerous components (software) and require a large amount of computing resources, usually run in dedicated clusters or high performance computing (HPC) centers. Nowadays, Cloud computing system with the ability of providing massive computing resources and customizable execution environment is becoming an attractive option for CEAs. As a new type on Cloud applications, CEA also brings the challenges of dealing with Cloud resources. In this paper, we provide a comprehensive survey of Cloud resource management research for CEAs. The survey puts forward two important questions: 1) what are the main challenges for CEAs to run in Clouds? and 2) what are the prior research topics addressing these challenges? We summarize and highlight the main challenges and prior research topics. Our work can be probably helpful to those scientists and engineers who are interested in running CEAs in Cloud environment.
References |
Supplementary Material |
Related Articles |
Metrics
|
|
iGraph: an incremental data processing system for dynamic graph
Wuyang JU,Jianxin LI,Weiren YU,Richong ZHANG
Front. Comput. Sci.. 2016, 10 (3): 462-476.
https://doi.org/10.1007/s11704-016-5485-7
With the popularity of social network, the demand for real-time processing of graph data is increasing. However, most of the existing graph systems adopt a batch processing mode, therefore the overhead of maintaining and processing of dynamic graph is significantly high. In this paper, we design iGraph, an incremental graph processing system for dynamic graph with its continuous updates. The contributions of iGraph include: 1) a hash-based graph partition strategy to enable fine-grained graph updates; 2) a vertexbased graph computing model to support incremental data processing; 3) detection and rebalance methods of hotspot to address the workload imbalance problem during incremental processing. Through the general-purpose API, iGraph can be used to implement various graph processing algorithms such as PageRank. We have implemented iGraph on Apache Spark, and experimental results show that for real life datasets, iGraph outperforms the original GraphX in respect of graph update and graph computation.
References |
Supplementary Material |
Related Articles |
Metrics
|
|
Understanding information interactions in diffusion: an evolutionary game-theoretic perspective
Yuan SU,Xi ZHANG,Lixin LIU,Shouyou SONG,Binxing FANG
Front. Comput. Sci.. 2016, 10 (3): 518-531.
https://doi.org/10.1007/s11704-015-5008-y
Social networks are fundamental mediums for diffusion of information and contagions appear at some node of the network and get propagated over the edges. Prior researches mainly focus on each contagion spreading independently, regardless of multiple contagions’ interactions as they propagate at the same time. In the real world, simultaneous news and events usually have to compete for user’s attention to get propagated. In some other cases, they can cooperate with each other and achieve more influences. In this paper, an evolutionary game theoretic framework is proposed to model the interactions among multiple contagions. The basic idea is that different contagions in social networks are similar to the multiple organisms in a population, and the diffusion process is as organisms interact and then evolve from one state to another. This framework statistically learns the payoffs as contagions interacting with each other and builds the payoff matrix. Since learning payoffs for all pairs of contagions IS almost impossible (quadratic in the number of contagions), a contagion clustering method is proposed in order to decrease the number of parameters to fit, which makes our approach efficient and scalable. To verify the proposed framework, we conduct experiments by using real-world information spreading dataset of Digg. Experimental results show that the proposed game theoretic framework helps to comprehend the information diffusion process better and can predict users’ forwarding behaviors with more accuracy than the previous studies. The analyses of evolution dynamics of contagions and evolutionarily stable strategy reveal whether a contagion can be promoted or suppressed by others in the diffusion process.
References |
Supplementary Material |
Related Articles |
Metrics
|
|
FlexPoll: adaptive event polling for network-intensive applications
Xingbo WU,Xiang LONG,Lei WANG
Front. Comput. Sci.. 2016, 10 (3): 532-542.
https://doi.org/10.1007/s11704-016-3453-x
In today’s data centers supporting Internet-scale computing and input/output (I/O) services, increasinglymore network-intensive applications are deployed on the network as a service. To this end, it is critical for the applications to quickly retrieve requests from the network and send their responses to the network. To facilitate this network function, operating system usually provides an event notification mechanism so that the applications (or the library) know if the network is ready to supply data for them to read or to receive data for them to write. As a widely used and representative notification mechanism, epoll in Linux provides a scalable and high-performance implementation by allowing applications to specifically indicate which connections and what events on them need to be watched. As epoll has been used in some major systems, including key-value (KV) systems, such as Redis and Memcached, and web server systems such as NGINX, we have identified a substantial performance issue in its use. For the sake of efficiency, applications usually use epoll’s system calls to inform the kernel exactly of what events they are interested in and always keep the information up-to-date. However, in a system with demanding network traffic, such a rigid maintenance of the information is not necessary and the excess number of system calls for this purpose can substantially degrade the system’s performance. In this paper, we use Redis as an example to explore the issue. We propose a strategy of informing the kernel of the interest events in a manner adaptive to the current network load, so that the epoll system calls can be reduced and the events can be efficiently delivered. We have implemented an event-polling library, named as FlexPoll, purely in user-level without modifying any kernel code. Our evaluation on Redis shows that the query throughput can be improved by up to 46.9% on micro-benchmarks, and even up to 67.8% on workloads emulating real-world access patterns. FlexPoll is a generic mechanism thus it can be adopted by other applications in a straightforward manner, such as NGINX and Memcached.
References |
Supplementary Material |
Related Articles |
Metrics
|
|
Joint study on VMs deployment, assignment and migration in geographically distributed data centers
Chuang LIN,Min YAO,Yin LI
Front. Comput. Sci.. 2016, 10 (3): 559-573.
https://doi.org/10.1007/s11704-015-5056-3
Enterprises build private clouds to provide IT resources for geographically distributed subsidiaries or product divisions. Public cloud providers like Amazon lease their platforms to enterprise users, thus, enterprises can also rent a number of virtual machines (VMs) from their data centers in the service provider networks. Unfortunately, the network cannot always guarantee stable connectivity for their clients to access the VMs or low-latency transfer among data centers. Usually, both latency and bandwidth are in unstable network environment. Being affected by background traffics, the network status can be volatile. To reduce the latency uncertainty of client accesses, enterprises should consider the network status when they deploy data centers or rent virtual data centers from cloud providers. In this paper, we first develop a data center deployment and assignment scheme for an enterprise to meet its users’ requirements under uncertain network status. To accommodate to the changes of the network status and users’ demands, a VMs migration-based redeployment scheme is adopted. These two schemes work in a joint way, and lay out a framework to help enterprises make better use of private or public clouds.
References |
Related Articles |
Metrics
|
|
Coordinating workload balancing and power switching in renewable energy powered data center
Xian LI,Rui WANG,Zhongzhi LUAN,Yi LIU,Depei QIAN
Front. Comput. Sci.. 2016, 10 (3): 574-587.
https://doi.org/10.1007/s11704-015-5018-9
There has been growing concern about energy consumption and environmental impact of datacenters. Some pioneers begin to power datacenters with renewable energy to offset carbon footprint. However, it is challenging to integrate intermittent renewable energy into datacenter power system. Grid-tied system is widely deployed in renewable energy powered datacenters. But the drawbacks (e.g. Harmonic disturbance and costliness) of grid tie inverter harass this design. Besides, the mixture of green load and brown load makes power management heavily depend on software measurement and monitoring, which often suffers inaccuracy. We propose DualPower, a novel power provisioning architecture that enables green datacenters to integrate renewable power supply without grid tie inverters. To optimize DualPower operation, we propose a specially designed power management framework to coordinate workload balancing with power supply switching. We evaluate three optimization schemes (LM, PS and JO) under different datacenter operation scenarios on our trace-driven simulation platform. The experimental results show that DualPower can be as efficient as grid-tied system and has good scalability. In contrast to previous works, DualPower integrates renewable power at lower cost and maintains full availability of datacenter servers.
References |
Supplementary Material |
Related Articles |
Metrics
|
13 articles
|