|
|
APPCorp: a corpus for Android privacy policy document structure analysis |
Shuang LIU1, Fan ZHANG2, Baiyang ZHAO1, Renjie GUO1, Tao CHEN3, Meishan ZHANG2( ) |
1. College of Intelligence and Computing, Tianjin University, Tianjin 300372, China 2. School of New Media and Communication, Tianjin University, Tianjin 300350, China 3. Google, Mountain View, CA 94043, USA |
|
|
Abstract With the increasing popularity of mobile devices and the wide adoption of mobile Apps, an increasing concern of privacy issues is raised. Privacy policy is identified as a proper medium to indicate the legal terms, such as the general data protection regulation (GDPR), and to bind legal agreement between service providers and users. However, privacy policies are usually long and vague for end users to read and understand. It is thus important to be able to automatically analyze the document structures of privacy policies to assist user understanding. In this work we create a manually labelled corpus containing 231 privacy policies (of more than 566,000 words and 7,748 annotated paragraphs). We benchmark our data corpus with 3 document classification models and achieve more than 82% on F1-score.
|
Keywords
privacy policy
GDPR
document structure analysis
representation learning
graph neural network
|
Corresponding Author(s):
Meishan ZHANG
|
About author: Tongcan Cui and Yizhe Hou contributed equally to this work. |
Just Accepted Date: 16 February 2022
Issue Date: 08 September 2022
|
|
1 |
A M, McDonald L F Cranor . The cost of reading privacy policies. A Journal of Law and Policy for the Information Society, 2008, 4( 3): 543– 568
|
2 |
F, Liu S, Wilson P, Story S, Zimmeck N Sadeh. Towards automatic classification of privacy policy text. Pittsburgh: School of Computer Science, Carnegie Mellon University, 2018
|
3 |
S, Wilson F, Schaub A A, Dara F, Liu S, Cherivirala P G, Leon M S, Andersen S, Zimmeck K M, Sathyendra N C, Russell T B, Norton E, Hovy J, Reidenberg N Sadeh. The creation and analysis of a website privacy policy corpus. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 1330– 1340
|
4 |
S, Zimmeck P, Story D, Smullen A, Ravichander Z Q, Wang J, Reidenberg N C, Russell N Sadeh . MAPS: scaling privacy compliance analysis to a million apps. Proceedings on Privacy Enhancing Technologies, 2019, 2019( 3): 66– 86
|
5 |
L, Lebanoff F Liu. Automatic detection of vague words and sentences in privacy policies. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3508– 3517
|
6 |
J, Kaur R A, Dara C, Obimbo F, Song K Menard. A comprehensive keyword analysis of online privacy policies. Information Security Journal: A Global Perspective, 2018, 27( 5– 6): 5– 6
|
7 |
D, Sarne J, Schler A, Singer A, Sela Siman Tov I Bar. Unsupervised topic extraction from privacy policies. In: Proceedings of 2019 World Wide Web Conference. 2019, 563– 568
|
8 |
C, Cortes V Vapnik . Support-vector networks. Machine Learning, 1995, 20( 3): 273– 297
|
9 |
Z, Yang D, Yang C, Dyer X, He A, Smola E Hovy. Hierarchical attention networks for document classification. In: Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016, 1480– 1489
|
10 |
K M, Sathyendra S, Wilson F, Schaub S, Zimmeck N Sadeh. Identifying the provision of choices in privacy policy text. In: Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 2774– 2779
|
11 |
V B, Kumar R, Iyengar N, Nisal Y, Feng H, Habib P, Story S, Cherivirala M, Hagan L, Cranor S, Wilson F, Schaub N Sadeh. Finding a choice in a haystack: automatic extraction of opt-out statements from privacy policy text. In: Proceedings of Web Conference 2020. 2020, 1943− 1954
|
12 |
F, Liu R, Ramanath N, Sadeh N A Smith. A step towards usable privacy policy: automatic alignment of privacy statements. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 2014, 884− 894
|
13 |
W B, Tesfay P, Hofmann T, Nakamura S, Kiyomoto J Serna. I read but don’t agree: privacy policy benchmarking using machine learning and the EU GDPR. In: Proceedings of Web Conference 2018. 2018, 163− 166
|
14 |
A, Ravichander A W, Black S, Wilson T, Norton N Sadeh. Question answering for privacy policies: combining computational and legal perspectives. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 4947− 4958
|
15 |
V B, Kumar A, Ravichander S, Story N Sadeh. Quantifying the effect of in-domain distributed word representations: a study of privacy policies. In: Proceedings of AAAI Spring Symposium on Privacy-Enhancing Artificial Intelligence and Language Technologies. 2019
|
16 |
J, Pennington R, Socher C Manning. GloVe: global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532− 1543
|
17 |
S, Zimmeck Z, Wang L, Zou R, Iyengar B, Liu F, Schaub S, Wilson N, Sadeh S, Bellovin J Reidenberg. Automated analysis of privacy requirements for mobile apps. In: Proceedings of 2016 AAAI Fall Symposium Series. 2016
|
18 |
C, Chang H, Li Y, Zhang S, Du H, Cao H Zhu. Automated and personalized privacy policy extraction under GDPR consideration. In: Proceedings of the 14th International Conference on Wireless Algorithms, Systems, and Applications. 2019, 43− 54
|
19 |
S, Liu B, Zhao R, Guo G, Meng F, Zhang M Zhang. Have you been properly notified? Automatic compliance analysis of privacy policy text with GDPR article. In: Proceedings of Web Conference 2021. 2021, 2154− 2164
|
20 |
M, Degeling C, Utz C, Lentzsch H, Hosseini F, Schaub T Holz . We value your privacy... now take some cookies: measuring the GDPR’s impact on web privacy. Informatik Spektrum, 2019, 42( 5): 345– 346
|
21 |
J, Yang Y, Zhang L, Li X Li. YEDDA: a lightweight collaborative text span annotation tool. In: Proceedings of ACL 2018, System Demonstrations. 2018, 31− 36
|
22 |
J L Fleiss . Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971, 76( 5): 378– 382
|
23 |
S, Wang C Manning. Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 2012, 90− 94
|
24 |
J Ramos. Using TF-IDF to determine word relevance in document queries. In: Proceedings of the 1st Instructional Conference on Machine Learning. 2003, 29− 48
|
25 |
A, Graves N, Jaitly A R Mohamed. Hybrid speech recognition with deep bidirectional LSTM. In: Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. 2013, 273− 278
|
26 |
T, Mikolov I, Sutskever K, Chen G, Corrado J Dean. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3111− 3119
|
27 |
J, Devlin M W, Chang K, Lee K Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. 2019, 4171− 4186
|
28 |
C, Sun X, Qiu Y, Xu X Huang. How to Fine-tune BERT for text classification? In: Proceedings of the 18th China National Conference on Chinese Computational Linguistics. 2019, 194− 206
|
29 |
P, Veličković G, Cucurull A, Casanova A, Romero P, Liò Y Bengio. Graph attention networks. 2017, arXiv preprint arXiv: 1710.10903
|
30 |
K, Cho Merriënboer B, Van C, Gulcehre D, Bahdanau F, Bougares H, Schwenk Y Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1724− 1734
|
31 |
F, Pedregosa G, Varoquaux A, Gramfort V, Michel B, Thirion O, Grisel M, Blondel P, Prettenhofer R, Weiss V, Dubourg J, Vanderplas A, Passos D, Cournapeau M, Brucher M, Perrot É Duchesnay . Scikit-learn: machine learning in python. The Journal of Machine Learning Research, 2011, 12: 2825– 2830
|
32 |
M, Fey J E Lenssen. Fast graph representation learning with PyTorch Geometric. 2019, arXiv preprint arXiv: 1903.02428
|
33 |
D P, Kingma J Ba. Adam: a method for stochastic optimization. 2017, arXiv preprint arXiv: 1412.6980
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|