Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (3) : 173320    https://doi.org/10.1007/s11704-022-1627-2
RESEARCH ARTICLE
APPCorp: a corpus for Android privacy policy document structure analysis
Shuang LIU1, Fan ZHANG2, Baiyang ZHAO1, Renjie GUO1, Tao CHEN3, Meishan ZHANG2()
1. College of Intelligence and Computing, Tianjin University, Tianjin 300372, China
2. School of New Media and Communication, Tianjin University, Tianjin 300350, China
3. Google, Mountain View, CA 94043, USA
 Download: PDF(3642 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

With the increasing popularity of mobile devices and the wide adoption of mobile Apps, an increasing concern of privacy issues is raised. Privacy policy is identified as a proper medium to indicate the legal terms, such as the general data protection regulation (GDPR), and to bind legal agreement between service providers and users. However, privacy policies are usually long and vague for end users to read and understand. It is thus important to be able to automatically analyze the document structures of privacy policies to assist user understanding. In this work we create a manually labelled corpus containing 231 privacy policies (of more than 566,000 words and 7,748 annotated paragraphs). We benchmark our data corpus with 3 document classification models and achieve more than 82% on F1-score.

Keywords privacy policy      GDPR      document structure analysis      representation learning      graph neural network     
Corresponding Author(s): Meishan ZHANG   
About author: Tongcan Cui and Yizhe Hou contributed equally to this work.
Just Accepted Date: 16 February 2022   Issue Date: 08 September 2022
 Cite this article:   
Shuang LIU,Fan ZHANG,Baiyang ZHAO, et al. APPCorp: a corpus for Android privacy policy document structure analysis[J]. Front. Comput. Sci., 2023, 17(3): 173320.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-1627-2
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I3/173320
Fig.1  The privacy policy excerpt examples. (a) The ZAO privacy policy excerpt in English translation; (b) The New York Times privacy policy excerpt
Item Count
No. Documents 231
No. Sentences 19,708
No. Words 566,475
Annotated paragraph 7,748
Annotators per document 3
Tab.1  The statistics on the privacy policy corpus
Label Frequency Coverage Avg.S Avg.W Fleiss’ Kappa
Policy introductory 638 0.69 2.23 52.92 0.65
First party collection and use 2,433 0.71 2.61 68.45 0.70
Cookies and similar technologies 465 0.48 3.00 64.51 0.72
Third party share and collection 1,316 0.68 2.65 69.90 0.67
User right and control 1,194 0.61 2.39 57.21 0.68
Data security 383 0.62 2.65 59.44 0.79
Data retention 211 0.43 2.26 62.63 0.72
International data transfer 198 0.42 2.39 64.86 0.70
Specific audiences 332 0.57 2.66 67.16 0.78
Policy change 246 0.61 2.80 54.97 0.76
Policy contact information 332 0.65 1.63 30.79 0.70
Tab.2  The per-label statistics in our corpus
Fig.2  Input representation and model structure of HGAT. The word embedding is adopted from GloVe or BERT. (a) Input representation; (b) the HAN model structure
GloVe BERT
Label SVM HAN HGAT HAN HGAT
P R F P R F P R F P R F P R F
PI 76.24 69.80 72.88 76.18 73.55 74.84 77.74 78.72 78.23 82.06 77.31 79.61 82.31 79.34 80.80
FPCU 75.01 86.98 80.55 81.02 82.32 81.66 83.15 81.58 82.36 82.55 86.00 84.24 82.91 85.02 83.95
CT 82.77 73.49 77.85 78.40 78.23 78.32 79.01 79.53 79.27 81.22 80.17 80.69 83.22 81.25 82.22
TPSC 78.73 74.83 76.73 79.56 77.80 78.67 77.67 78.26 77.96 80.57 80.32 80.44 79.48 80.93 80.20
URC 79.42 76.22 77.79 81.60 77.90 79.71 79.87 81.34 80.60 81.41 77.65 79.48 80.80 78.49 79.62
DS 86.29 72.51 78.81 77.42 81.68 79.49 82.32 81.68 82.00 82.63 82.20 82.41 86.11 81.15 83.56
DR 86.74 73.71 79.70 74.78 79.34 76.99 78.83 82.16 80.46 81.28 83.57 82.41 86.96 84.51 85.71
IDT 76.06 83.08 79.41 74.42 82.05 78.05 75.91 85.64 80.48 74.07 82.05 77.86 74.57 88.72 81.03
SA 92.45 73.57 81.94 79.83 83.18 81.47 83.58 84.08 83.83 86.08 81.68 83.82 88.12 84.68 86.37
PC 91.60 88.98 90.27 90.72 87.76 89.21 94.64 86.53 90.41 93.28 90.61 91.93 95.67 90.20 92.86
PCI 82.37 77.18 79.69 79.41 81.08 80.24 83.02 80.78 81.89 78.92 78.68 78.80 81.08 81.08 81.08
Micro 78.94 78.94 78.94 79.94 79.94 79.94 80.98 80.98 80.98 81.98 81.98 81.98 82.50 82.50 82.50
Macro 82.52 77.30 79.60 79.39 80.44 79.88 81.43 81.85 81.59 82.19 81.84 81.97 83.75 83.22 83.40
Tab.3  The Precision/Recall/F1 score of classification models
Fig.3  F1-score against categories
Fig.4  F1-score against paragraph length
Fig.5  Visualization of sentence attention for an example from test dataset. The models based on GloVe always give higher attention to the first sentence, while the models based on BERT give higher attention to the more relevant sentences. (a) An example with the label IDT; (b) visualization of sentence attention
Fig.6  Visualization of word attention for an example with the label User Right and Control (URC). The models with BERT give uniform attention relatively and the two HGAT models pay more attention to the most relevant words according to the syntactic knowledge, especially for the root word “control”. (a) The dependency tree of the only input sentence; (b) visualization of word attention
  
  
  
  
  
  
1 A M, McDonald L F Cranor . The cost of reading privacy policies. A Journal of Law and Policy for the Information Society, 2008, 4( 3): 543– 568
2 F, Liu S, Wilson P, Story S, Zimmeck N Sadeh. Towards automatic classification of privacy policy text. Pittsburgh: School of Computer Science, Carnegie Mellon University, 2018
3 S, Wilson F, Schaub A A, Dara F, Liu S, Cherivirala P G, Leon M S, Andersen S, Zimmeck K M, Sathyendra N C, Russell T B, Norton E, Hovy J, Reidenberg N Sadeh. The creation and analysis of a website privacy policy corpus. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 1330– 1340
4 S, Zimmeck P, Story D, Smullen A, Ravichander Z Q, Wang J, Reidenberg N C, Russell N Sadeh . MAPS: scaling privacy compliance analysis to a million apps. Proceedings on Privacy Enhancing Technologies, 2019, 2019( 3): 66– 86
5 L, Lebanoff F Liu. Automatic detection of vague words and sentences in privacy policies. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3508– 3517
6 J, Kaur R A, Dara C, Obimbo F, Song K Menard. A comprehensive keyword analysis of online privacy policies. Information Security Journal: A Global Perspective, 2018, 27( 5– 6): 5– 6
7 D, Sarne J, Schler A, Singer A, Sela Siman Tov I Bar. Unsupervised topic extraction from privacy policies. In: Proceedings of 2019 World Wide Web Conference. 2019, 563– 568
8 C, Cortes V Vapnik . Support-vector networks. Machine Learning, 1995, 20( 3): 273– 297
9 Z, Yang D, Yang C, Dyer X, He A, Smola E Hovy. Hierarchical attention networks for document classification. In: Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016, 1480– 1489
10 K M, Sathyendra S, Wilson F, Schaub S, Zimmeck N Sadeh. Identifying the provision of choices in privacy policy text. In: Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 2774– 2779
11 V B, Kumar R, Iyengar N, Nisal Y, Feng H, Habib P, Story S, Cherivirala M, Hagan L, Cranor S, Wilson F, Schaub N Sadeh. Finding a choice in a haystack: automatic extraction of opt-out statements from privacy policy text. In: Proceedings of Web Conference 2020. 2020, 1943− 1954
12 F, Liu R, Ramanath N, Sadeh N A Smith. A step towards usable privacy policy: automatic alignment of privacy statements. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 2014, 884− 894
13 W B, Tesfay P, Hofmann T, Nakamura S, Kiyomoto J Serna. I read but don’t agree: privacy policy benchmarking using machine learning and the EU GDPR. In: Proceedings of Web Conference 2018. 2018, 163− 166
14 A, Ravichander A W, Black S, Wilson T, Norton N Sadeh. Question answering for privacy policies: combining computational and legal perspectives. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 4947− 4958
15 V B, Kumar A, Ravichander S, Story N Sadeh. Quantifying the effect of in-domain distributed word representations: a study of privacy policies. In: Proceedings of AAAI Spring Symposium on Privacy-Enhancing Artificial Intelligence and Language Technologies. 2019
16 J, Pennington R, Socher C Manning. GloVe: global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532− 1543
17 S, Zimmeck Z, Wang L, Zou R, Iyengar B, Liu F, Schaub S, Wilson N, Sadeh S, Bellovin J Reidenberg. Automated analysis of privacy requirements for mobile apps. In: Proceedings of 2016 AAAI Fall Symposium Series. 2016
18 C, Chang H, Li Y, Zhang S, Du H, Cao H Zhu. Automated and personalized privacy policy extraction under GDPR consideration. In: Proceedings of the 14th International Conference on Wireless Algorithms, Systems, and Applications. 2019, 43− 54
19 S, Liu B, Zhao R, Guo G, Meng F, Zhang M Zhang. Have you been properly notified? Automatic compliance analysis of privacy policy text with GDPR article. In: Proceedings of Web Conference 2021. 2021, 2154− 2164
20 M, Degeling C, Utz C, Lentzsch H, Hosseini F, Schaub T Holz . We value your privacy... now take some cookies: measuring the GDPR’s impact on web privacy. Informatik Spektrum, 2019, 42( 5): 345– 346
21 J, Yang Y, Zhang L, Li X Li. YEDDA: a lightweight collaborative text span annotation tool. In: Proceedings of ACL 2018, System Demonstrations. 2018, 31− 36
22 J L Fleiss . Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971, 76( 5): 378– 382
23 S, Wang C Manning. Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 2012, 90− 94
24 J Ramos. Using TF-IDF to determine word relevance in document queries. In: Proceedings of the 1st Instructional Conference on Machine Learning. 2003, 29− 48
25 A, Graves N, Jaitly A R Mohamed. Hybrid speech recognition with deep bidirectional LSTM. In: Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. 2013, 273− 278
26 T, Mikolov I, Sutskever K, Chen G, Corrado J Dean. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3111− 3119
27 J, Devlin M W, Chang K, Lee K Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. 2019, 4171− 4186
28 C, Sun X, Qiu Y, Xu X Huang. How to Fine-tune BERT for text classification? In: Proceedings of the 18th China National Conference on Chinese Computational Linguistics. 2019, 194− 206
29 P, Veličković G, Cucurull A, Casanova A, Romero P, Liò Y Bengio. Graph attention networks. 2017, arXiv preprint arXiv: 1710.10903
30 K, Cho Merriënboer B, Van C, Gulcehre D, Bahdanau F, Bougares H, Schwenk Y Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1724− 1734
31 F, Pedregosa G, Varoquaux A, Gramfort V, Michel B, Thirion O, Grisel M, Blondel P, Prettenhofer R, Weiss V, Dubourg J, Vanderplas A, Passos D, Cournapeau M, Brucher M, Perrot É Duchesnay . Scikit-learn: machine learning in python. The Journal of Machine Learning Research, 2011, 12: 2825– 2830
32 M, Fey J E Lenssen. Fast graph representation learning with PyTorch Geometric. 2019, arXiv preprint arXiv: 1903.02428
33 D P, Kingma J Ba. Adam: a method for stochastic optimization. 2017, arXiv preprint arXiv: 1412.6980
[1] FCS-21627-OF-SL_suppl_1 Download
[1] Junfei TANG, Ran SONG, Yuxin HUANG, Shengxiang GAO, Zhengtao YU. Semantic-aware entity alignment for low resource language knowledge graph[J]. Front. Comput. Sci., 2024, 18(4): 184319-.
[2] Yi ZHU, Yishuai GENG, Yun LI, Jipeng QIANG, Xindong WU. Representation learning: serial-autoencoder for personalized recommendation[J]. Front. Comput. Sci., 2024, 18(4): 184316-.
[3] Miao ZHANG, Tingting HE, Ming DONG. Meta-path reasoning of knowledge graph for commonsense question answering[J]. Front. Comput. Sci., 2024, 18(1): 181303-.
[4] Yongquan LIANG, Qiuyu SONG, Zhongying ZHAO, Hui ZHOU, Maoguo GONG. BA-GNN: Behavior-aware graph neural network for session-based recommendation[J]. Front. Comput. Sci., 2023, 17(6): 176613-.
[5] Yi ZHU, Xindong WU, Jipeng QIANG, Yunhao YUAN, Yun LI. Representation learning via an integrated autoencoder for unsupervised domain adaptation[J]. Front. Comput. Sci., 2023, 17(5): 175334-.
[6] Jinwei LUO, Mingkai HE, Weike PAN, Zhong MING. BGNN: Behavior-aware graph neural network for heterogeneous session-based recommendation[J]. Front. Comput. Sci., 2023, 17(5): 175336-.
[7] Yuan GAO, Xiang WANG, Xiangnan HE, Huamin FENG, Yongdong ZHANG. Rumor detection with self-supervised learning on texts and social graph[J]. Front. Comput. Sci., 2023, 17(4): 174611-.
[8] Zhe XUE, Junping DU, Xin XU, Xiangbin LIU, Junfu WANG, Feifei KOU. Few-shot node classification via local adaptive discriminant structure learning[J]. Front. Comput. Sci., 2023, 17(2): 172316-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed