Please wait a minute...
Frontiers of Electrical and Electronic Engineering

ISSN 2095-2732

ISSN 2095-2740(Online)

CN 10-1028/TM

Front Elect Electr Eng Chin    2011, Vol. 6 Issue (2) : 388-396    https://doi.org/10.1007/s11460-011-0137-z
RESEARCH ARTICLE
A best-effort approach to an infrastructure for Chinese Web related research
Weining QIAN(), Aoying ZHOU, Minqi ZHOU
Institute of Massive Computing, Software Engineering Institute, East China Normal University, Shanghai 200241, China
 Download: PDF(330 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

The design of the infrastructure for Chinese Web (CWI), a prototype system aimed at forum data analysis, is introduced. CWI takes a best effort approach. 1) It tries its best to extract or annotate semantics over the web data. 2) It provides flexible schemes for users to transform the web data into eXtensible Markup Language (XML) forms with more semantic annotations that are more friendly for further analytical tasks. 3) A distributed graph repository, called DISGR is used as backend for management of web data. The paper introduces the design issues, reports the progress of the implementation, and discusses the research issues that are under study.

Keywords Chinese Web infrastructure      semantic entity      graph data model      distributed storage     
Corresponding Author(s): QIAN Weining,Email:wnqian@sei.ecnu.edu.cn   
Issue Date: 05 June 2011
 Cite this article:   
Weining QIAN,Aoying ZHOU,Minqi ZHOU. A best-effort approach to an infrastructure for Chinese Web related research[J]. Front Elect Electr Eng Chin, 2011, 6(2): 388-396.
 URL:  
https://academic.hep.com.cn/fee/EN/10.1007/s11460-011-0137-z
https://academic.hep.com.cn/fee/EN/Y2011/V6/I2/388
1 Qian W, Zhou A. Chinese Web infrastructure building: challenges and our roadmap. In: Proceedings of International Workshop on Information-Explosion and Next Generation Search . 2008, 4-11
doi: 10.1109/INGS.2008.21
2 China Internet Network Information Center. The 24th Statistical Report on the Development of the Chinese Internet. CNNIC , 2009
3 Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In: Proceedings of Operating Systems Design and Implementation . 2004, 137-150
4 Clark J, DeRose S. XML Path language (XPath) version 1.0. World Wide Web Consortium Recommendation , 1999. http://www.w3.org/TR/xpath.html
5 Clark J. XSL Transformations (XSLT). World Wide Web Consortium Recommendation , 1999. http://www.w3.org/TR/xslt
6 Sarawagi S. Information extraction. Foundations and Trends in Databases , 2008, 1(3): 261-377
doi: 10.1561/1900000003
7 Cai P, Luo H, Zhou A. Semantic entity detection by integrating CRF and SVM. In: Proceedings of the 11th International Conference on Web-Age Information Management. Lecture Notes in Computer Science , 2010, 6184: 483-494
doi: 10.1007/978-3-642-14246-8_47
8 Zhou A, Qian W, Tao D, Ma Q. DISG: a distributed graph repository for web infrastructure. In: Proceedings of the Second International Symposium on Universal Communication . 2008, 141-145
doi: 10.1109/ISUC.2008.83
9 Qian W. Storage and index support for data intensive web applications. In: Proceedings of the 4th International Universal Communication Symposium . 2010, 62-68
doi: 10.1109/IUCS.2010.5666650
10 Arocena G O, Mendelzon A O, Mihaila G A. Applications of a web query language. Computer Networks , 1997, 29(8-13): 1305-1315
11 Arocena G O, Mendelzon A O. WebOQL: restructuring documents, databases, and webs. In: Proceedings of the 14th International Conference on Data Engineering . 1998, 24-33
doi: 10.1109/ICDE.1998.655754
12 DeWitt D, Gray J. Parallel database systems: the future of high performance database systems. Communications of the ACM , 1992, 35(6): 85-98
doi: 10.1145/129888.129894
13 Li J, Gao H, Luo J, Shi S, Zhang W. InfiniteDB: a PCcluster based parallel massive database management system. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data . 2007, 899-909
doi: 10.1145/1247480.1247585
14 Ghemawat S, Gobioff H, Leung S T. The Google file system. In: Proceedings of the 9th ACM Symposium on Operating Systems Principles . 2003, 29-43
doi: 10.1145/945445.945450
15 Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of European Conference on Computer Systems . 2007, 59-72
16 Chang F, Dean J, Ghemawat S, Hsieh W C, Wallach D A, Burrows M, Chandra T, Fikes A, Gruber R E. Bigtable: a distributed storage system for structured data. ACM Transactions on Computer Systems , 2008, 26(2): 1-26
doi: 10.1145/1365815.1365816
17 Pike R, Dorward S, Griesemer R, Quinlan S. Interpreting the data: parallel analysis with Sawzall. Scientific Programming , 2005, 13(4): 277-298
18 Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data . 2008, 1099-1110
doi: 10.1145/1376616.1376726
19 Gates A, Natkovich O, Chopra S, Kamath P, Narayanam S, Olston C, Reed B, Srinivasan S, Srivastava U. Building a highlevel dataflow system on top of mapreduce: the Pig experience. Proceedings of the VLDB Endowment , 2009, 2(2): 1414-1425
20 Wen J R, Ma W Y. Webstudio: building infrastructure for web data management. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data . 2007, 875-876
doi: 10.1145/1247480.1247578
21 Mendelzon A O, Wood P T. Finding regular simple paths in graph databases. SIAM Journal on Computing , 1995, 24(6): 1235-1258
doi: 10.1137/S009753979122370X
22 Cheng J, Ke Y, Ng W. Efficient query processing on graph databases. ACM Transactions on Database Systems , 2009, 34(1): 1-48
doi: 10.1145/1508857.1508859
23 Qun C, Lim A, Ong K W. D(k)-index: an adaptive structural summary for graph-structured data. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data . 2003, 134-144
24 Yan Y, Wang C, Zhou A, Qian W, Ma L, Pan Y. Efficient indices using graph partitioning in RDF triple stores. In: Proceedings of the 25th International Conference on Data Engineering . 2009, 1263-1266
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed