|
|
A best-effort approach to an infrastructure for Chinese Web related research |
Weining QIAN(), Aoying ZHOU, Minqi ZHOU |
Institute of Massive Computing, Software Engineering Institute, East China Normal University, Shanghai 200241, China |
|
|
Abstract The design of the infrastructure for Chinese Web (CWI), a prototype system aimed at forum data analysis, is introduced. CWI takes a best effort approach. 1) It tries its best to extract or annotate semantics over the web data. 2) It provides flexible schemes for users to transform the web data into eXtensible Markup Language (XML) forms with more semantic annotations that are more friendly for further analytical tasks. 3) A distributed graph repository, called DISGR is used as backend for management of web data. The paper introduces the design issues, reports the progress of the implementation, and discusses the research issues that are under study.
|
Keywords
Chinese Web infrastructure
semantic entity
graph data model
distributed storage
|
Corresponding Author(s):
QIAN Weining,Email:wnqian@sei.ecnu.edu.cn
|
Issue Date: 05 June 2011
|
|
1 |
Qian W, Zhou A. Chinese Web infrastructure building: challenges and our roadmap. In: Proceedings of International Workshop on Information-Explosion and Next Generation Search . 2008, 4-11 doi: 10.1109/INGS.2008.21
|
2 |
China Internet Network Information Center. The 24th Statistical Report on the Development of the Chinese Internet. CNNIC , 2009
|
3 |
Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In: Proceedings of Operating Systems Design and Implementation . 2004, 137-150
|
4 |
Clark J, DeRose S. XML Path language (XPath) version 1.0. World Wide Web Consortium Recommendation , 1999. http://www.w3.org/TR/xpath.html
|
5 |
Clark J. XSL Transformations (XSLT). World Wide Web Consortium Recommendation , 1999. http://www.w3.org/TR/xslt
|
6 |
Sarawagi S. Information extraction. Foundations and Trends in Databases , 2008, 1(3): 261-377 doi: 10.1561/1900000003
|
7 |
Cai P, Luo H, Zhou A. Semantic entity detection by integrating CRF and SVM. In: Proceedings of the 11th International Conference on Web-Age Information Management. Lecture Notes in Computer Science , 2010, 6184: 483-494 doi: 10.1007/978-3-642-14246-8_47
|
8 |
Zhou A, Qian W, Tao D, Ma Q. DISG: a distributed graph repository for web infrastructure. In: Proceedings of the Second International Symposium on Universal Communication . 2008, 141-145 doi: 10.1109/ISUC.2008.83
|
9 |
Qian W. Storage and index support for data intensive web applications. In: Proceedings of the 4th International Universal Communication Symposium . 2010, 62-68 doi: 10.1109/IUCS.2010.5666650
|
10 |
Arocena G O, Mendelzon A O, Mihaila G A. Applications of a web query language. Computer Networks , 1997, 29(8-13): 1305-1315
|
11 |
Arocena G O, Mendelzon A O. WebOQL: restructuring documents, databases, and webs. In: Proceedings of the 14th International Conference on Data Engineering . 1998, 24-33 doi: 10.1109/ICDE.1998.655754
|
12 |
DeWitt D, Gray J. Parallel database systems: the future of high performance database systems. Communications of the ACM , 1992, 35(6): 85-98 doi: 10.1145/129888.129894
|
13 |
Li J, Gao H, Luo J, Shi S, Zhang W. InfiniteDB: a PCcluster based parallel massive database management system. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data . 2007, 899-909 doi: 10.1145/1247480.1247585
|
14 |
Ghemawat S, Gobioff H, Leung S T. The Google file system. In: Proceedings of the 9th ACM Symposium on Operating Systems Principles . 2003, 29-43 doi: 10.1145/945445.945450
|
15 |
Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of European Conference on Computer Systems . 2007, 59-72
|
16 |
Chang F, Dean J, Ghemawat S, Hsieh W C, Wallach D A, Burrows M, Chandra T, Fikes A, Gruber R E. Bigtable: a distributed storage system for structured data. ACM Transactions on Computer Systems , 2008, 26(2): 1-26 doi: 10.1145/1365815.1365816
|
17 |
Pike R, Dorward S, Griesemer R, Quinlan S. Interpreting the data: parallel analysis with Sawzall. Scientific Programming , 2005, 13(4): 277-298
|
18 |
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data . 2008, 1099-1110 doi: 10.1145/1376616.1376726
|
19 |
Gates A, Natkovich O, Chopra S, Kamath P, Narayanam S, Olston C, Reed B, Srinivasan S, Srivastava U. Building a highlevel dataflow system on top of mapreduce: the Pig experience. Proceedings of the VLDB Endowment , 2009, 2(2): 1414-1425
|
20 |
Wen J R, Ma W Y. Webstudio: building infrastructure for web data management. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data . 2007, 875-876 doi: 10.1145/1247480.1247578
|
21 |
Mendelzon A O, Wood P T. Finding regular simple paths in graph databases. SIAM Journal on Computing , 1995, 24(6): 1235-1258 doi: 10.1137/S009753979122370X
|
22 |
Cheng J, Ke Y, Ng W. Efficient query processing on graph databases. ACM Transactions on Database Systems , 2009, 34(1): 1-48 doi: 10.1145/1508857.1508859
|
23 |
Qun C, Lim A, Ong K W. D(k)-index: an adaptive structural summary for graph-structured data. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data . 2003, 134-144
|
24 |
Yan Y, Wang C, Zhou A, Qian W, Ma L, Pan Y. Efficient indices using graph partitioning in RDF triple stores. In: Proceedings of the 25th International Conference on Data Engineering . 2009, 1263-1266
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|