APPCorp: a corpus for Android privacy policy document structure analysis

doi:10.1007/s11704-022-1627-2

Front. Comput. Sci.

2023, Vol. 17

Issue (3) : 173320 https://doi.org/10.1007/s11704-022-1627-2

RESEARCH ARTICLE

APPCorp: a corpus for Android privacy policy document structure analysis

Shuang LIU¹, Fan ZHANG², Baiyang ZHAO¹, Renjie GUO¹, Tao CHEN³, Meishan ZHANG²(

)

¹. College of Intelligence and Computing, Tianjin University, Tianjin 300372, China
². School of New Media and Communication, Tianjin University, Tianjin 300350, China
³. Google, Mountain View, CA 94043, USA

Download: PDF(3642 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

With the increasing popularity of mobile devices and the wide adoption of mobile Apps, an increasing concern of privacy issues is raised. Privacy policy is identified as a proper medium to indicate the legal terms, such as the general data protection regulation (GDPR), and to bind legal agreement between service providers and users. However, privacy policies are usually long and vague for end users to read and understand. It is thus important to be able to automatically analyze the document structures of privacy policies to assist user understanding. In this work we create a manually labelled corpus containing 231 privacy policies (of more than 566,000 words and 7,748 annotated paragraphs). We benchmark our data corpus with 3 document classification models and achieve more than 82% on F1-score.

Keywords privacy policy GDPR document structure analysis representation learning graph neural network

Corresponding Author(s): Meishan ZHANG

About author: Tongcan Cui and Yizhe Hou contributed equally to this work.

Just Accepted Date: 16 February 2022 Issue Date: 08 September 2022

Cite this article:

Shuang LIU,Fan ZHANG,Baiyang ZHAO, et al. APPCorp: a corpus for Android privacy policy document structure analysis[J]. Front. Comput. Sci., 2023, 17(3): 173320.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-1627-2
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I3/173320

Fig.1 The privacy policy excerpt examples. (a) The ZAO privacy policy excerpt in English translation; (b)

The New York Times

Tab.1 The statistics on the privacy policy corpus

Label	$Frequency$	$Coverage$	$Avg.S$	$Avg.W$	Fleiss’ Kappa
Policy introductory	638	0.69	2.23	52.92	0.65
First party collection and use	2,433	0.71	2.61	68.45	0.70
Cookies and similar technologies	465	0.48	3.00	64.51	0.72
Third party share and collection	1,316	0.68	2.65	69.90	0.67
User right and control	1,194	0.61	2.39	57.21	0.68
Data security	383	0.62	2.65	59.44	0.79
Data retention	211	0.43	2.26	62.63	0.72
International data transfer	198	0.42	2.39	64.86	0.70
Specific audiences	332	0.57	2.66	67.16	0.78
Policy change	246	0.61	2.80	54.97	0.76
Policy contact information	332	0.65	1.63	30.79	0.70

Tab.2 The per-label statistics in our corpus

Fig.2 Input representation and model structure of

HGAT

. The word embedding is adopted from GloVe or BERT. (a) Input representation; (b) the

HAN

model structure

				GloVe						BERT
Label	SVM			HAN			HGAT			HAN			HGAT
	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F
$PI$	76.24	69.80	72.88	76.18	73.55	74.84	77.74	78.72	78.23	82.06	77.31	79.61	82.31	79.34	80.80
$FPCU$	75.01	86.98	80.55	81.02	82.32	81.66	83.15	81.58	82.36	82.55	86.00	84.24	82.91	85.02	83.95
$CT$	82.77	73.49	77.85	78.40	78.23	78.32	79.01	79.53	79.27	81.22	80.17	80.69	83.22	81.25	82.22
$TPSC$	78.73	74.83	76.73	79.56	77.80	78.67	77.67	78.26	77.96	80.57	80.32	80.44	79.48	80.93	80.20
$URC$	79.42	76.22	77.79	81.60	77.90	79.71	79.87	81.34	80.60	81.41	77.65	79.48	80.80	78.49	79.62
$DS$	86.29	72.51	78.81	77.42	81.68	79.49	82.32	81.68	82.00	82.63	82.20	82.41	86.11	81.15	83.56
$DR$	86.74	73.71	79.70	74.78	79.34	76.99	78.83	82.16	80.46	81.28	83.57	82.41	86.96	84.51	85.71
$IDT$	76.06	83.08	79.41	74.42	82.05	78.05	75.91	85.64	80.48	74.07	82.05	77.86	74.57	88.72	81.03
$SA$	92.45	73.57	81.94	79.83	83.18	81.47	83.58	84.08	83.83	86.08	81.68	83.82	88.12	84.68	86.37
$PC$	91.60	88.98	90.27	90.72	87.76	89.21	94.64	86.53	90.41	93.28	90.61	91.93	95.67	90.20	92.86
$PCI$	82.37	77.18	79.69	79.41	81.08	80.24	83.02	80.78	81.89	78.92	78.68	78.80	81.08	81.08	81.08
$Micro$	78.94	78.94	78.94	79.94	79.94	79.94	80.98	80.98	80.98	81.98	81.98	81.98	82.50	82.50	82.50
$Macro$	82.52	77.30	79.60	79.39	80.44	79.88	81.43	81.85	81.59	82.19	81.84	81.97	83.75	83.22	83.40

Tab.3 The Precision/Recall/F1 score of classification models

Fig.3 F1-score against categories

Fig.4 F1-score against paragraph length

Fig.5 Visualization of sentence attention for an example from test dataset. The models based on GloVe always give higher attention to the first sentence, while the models based on BERT give higher attention to the more relevant sentences. (a) An example with the label IDT; (b) visualization of sentence attention

Fig.6 Visualization of word attention for an example with the label User Right and Control (URC). The models with BERT give uniform attention relatively and the two

HGAT

models pay more attention to the most relevant words according to the syntactic knowledge, especially for the root word “control”. (a) The dependency tree of the only input sentence; (b) visualization of word attention

1	A M, McDonald L F Cranor . The cost of reading privacy policies. A Journal of Law and Policy for the Information Society, 2008, 4( 3): 543– 568
2	F, Liu S, Wilson P, Story S, Zimmeck N Sadeh. Towards automatic classification of privacy policy text. Pittsburgh: School of Computer Science, Carnegie Mellon University, 2018
3	S, Wilson F, Schaub A A, Dara F, Liu S, Cherivirala P G, Leon M S, Andersen S, Zimmeck K M, Sathyendra N C, Russell T B, Norton E, Hovy J, Reidenberg N Sadeh. The creation and analysis of a website privacy policy corpus. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 1330– 1340
4	S, Zimmeck P, Story D, Smullen A, Ravichander Z Q, Wang J, Reidenberg N C, Russell N Sadeh . MAPS: scaling privacy compliance analysis to a million apps. Proceedings on Privacy Enhancing Technologies, 2019, 2019( 3): 66– 86
5	L, Lebanoff F Liu. Automatic detection of vague words and sentences in privacy policies. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3508– 3517
6	J, Kaur R A, Dara C, Obimbo F, Song K Menard. A comprehensive keyword analysis of online privacy policies. Information Security Journal: A Global Perspective, 2018, 27( 5– 6): 5– 6
7	D, Sarne J, Schler A, Singer A, Sela Siman Tov I Bar. Unsupervised topic extraction from privacy policies. In: Proceedings of 2019 World Wide Web Conference. 2019, 563– 568
8	C, Cortes V Vapnik . Support-vector networks. Machine Learning, 1995, 20( 3): 273– 297
9	Z, Yang D, Yang C, Dyer X, He A, Smola E Hovy. Hierarchical attention networks for document classification. In: Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016, 1480– 1489
10	K M, Sathyendra S, Wilson F, Schaub S, Zimmeck N Sadeh. Identifying the provision of choices in privacy policy text. In: Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 2774– 2779
11	V B, Kumar R, Iyengar N, Nisal Y, Feng H, Habib P, Story S, Cherivirala M, Hagan L, Cranor S, Wilson F, Schaub N Sadeh. Finding a choice in a haystack: automatic extraction of opt-out statements from privacy policy text. In: Proceedings of Web Conference 2020. 2020, 1943− 1954
12	F, Liu R, Ramanath N, Sadeh N A Smith. A step towards usable privacy policy: automatic alignment of privacy statements. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 2014, 884− 894
13	W B, Tesfay P, Hofmann T, Nakamura S, Kiyomoto J Serna. I read but don’t agree: privacy policy benchmarking using machine learning and the EU GDPR. In: Proceedings of Web Conference 2018. 2018, 163− 166
14	A, Ravichander A W, Black S, Wilson T, Norton N Sadeh. Question answering for privacy policies: combining computational and legal perspectives. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 4947− 4958
15	V B, Kumar A, Ravichander S, Story N Sadeh. Quantifying the effect of in-domain distributed word representations: a study of privacy policies. In: Proceedings of AAAI Spring Symposium on Privacy-Enhancing Artificial Intelligence and Language Technologies. 2019
16	J, Pennington R, Socher C Manning. GloVe: global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532− 1543
17	S, Zimmeck Z, Wang L, Zou R, Iyengar B, Liu F, Schaub S, Wilson N, Sadeh S, Bellovin J Reidenberg. Automated analysis of privacy requirements for mobile apps. In: Proceedings of 2016 AAAI Fall Symposium Series. 2016
18	C, Chang H, Li Y, Zhang S, Du H, Cao H Zhu. Automated and personalized privacy policy extraction under GDPR consideration. In: Proceedings of the 14th International Conference on Wireless Algorithms, Systems, and Applications. 2019, 43− 54
19	S, Liu B, Zhao R, Guo G, Meng F, Zhang M Zhang. Have you been properly notified? Automatic compliance analysis of privacy policy text with GDPR article. In: Proceedings of Web Conference 2021. 2021, 2154− 2164
20	M, Degeling C, Utz C, Lentzsch H, Hosseini F, Schaub T Holz . We value your privacy... now take some cookies: measuring the GDPR’s impact on web privacy. Informatik Spektrum, 2019, 42( 5): 345– 346
21	J, Yang Y, Zhang L, Li X Li. YEDDA: a lightweight collaborative text span annotation tool. In: Proceedings of ACL 2018, System Demonstrations. 2018, 31− 36
22	J L Fleiss . Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971, 76( 5): 378– 382
23	S, Wang C Manning. Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 2012, 90− 94
24	J Ramos. Using TF-IDF to determine word relevance in document queries. In: Proceedings of the 1st Instructional Conference on Machine Learning. 2003, 29− 48
25	A, Graves N, Jaitly A R Mohamed. Hybrid speech recognition with deep bidirectional LSTM. In: Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. 2013, 273− 278
26	T, Mikolov I, Sutskever K, Chen G, Corrado J Dean. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3111− 3119
27	J, Devlin M W, Chang K, Lee K Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. 2019, 4171− 4186
28	C, Sun X, Qiu Y, Xu X Huang. How to Fine-tune BERT for text classification? In: Proceedings of the 18th China National Conference on Chinese Computational Linguistics. 2019, 194− 206
29	P, Veličković G, Cucurull A, Casanova A, Romero P, Liò Y Bengio. Graph attention networks. 2017, arXiv preprint arXiv: 1710.10903
30	K, Cho Merriënboer B, Van C, Gulcehre D, Bahdanau F, Bougares H, Schwenk Y Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1724− 1734
31	F, Pedregosa G, Varoquaux A, Gramfort V, Michel B, Thirion O, Grisel M, Blondel P, Prettenhofer R, Weiss V, Dubourg J, Vanderplas A, Passos D, Cournapeau M, Brucher M, Perrot É Duchesnay . Scikit-learn: machine learning in python. The Journal of Machine Learning Research, 2011, 12: 2825– 2830
32	M, Fey J E Lenssen. Fast graph representation learning with PyTorch Geometric. 2019, arXiv preprint arXiv: 1903.02428
33	D P, Kingma J Ba. Adam: a method for stochastic optimization. 2017, arXiv preprint arXiv: 1412.6980

[1]

FCS-21627-OF-SL_suppl_1

Download

[1]	Junfei TANG, Ran SONG, Yuxin HUANG, Shengxiang GAO, Zhengtao YU. Semantic-aware entity alignment for low resource language knowledge graph[J]. Front. Comput. Sci., 2024, 18(4): 184319-.
[2]	Yi ZHU, Yishuai GENG, Yun LI, Jipeng QIANG, Xindong WU. Representation learning: serial-autoencoder for personalized recommendation[J]. Front. Comput. Sci., 2024, 18(4): 184316-.
[3]	Miao ZHANG, Tingting HE, Ming DONG. Meta-path reasoning of knowledge graph for commonsense question answering[J]. Front. Comput. Sci., 2024, 18(1): 181303-.
[4]	Yongquan LIANG, Qiuyu SONG, Zhongying ZHAO, Hui ZHOU, Maoguo GONG. BA-GNN: Behavior-aware graph neural network for session-based recommendation[J]. Front. Comput. Sci., 2023, 17(6): 176613-.
[5]	Yi ZHU, Xindong WU, Jipeng QIANG, Yunhao YUAN, Yun LI. Representation learning via an integrated autoencoder for unsupervised domain adaptation[J]. Front. Comput. Sci., 2023, 17(5): 175334-.
[6]	Jinwei LUO, Mingkai HE, Weike PAN, Zhong MING. BGNN: Behavior-aware graph neural network for heterogeneous session-based recommendation[J]. Front. Comput. Sci., 2023, 17(5): 175336-.
[7]	Yuan GAO, Xiang WANG, Xiangnan HE, Huamin FENG, Yongdong ZHANG. Rumor detection with self-supervised learning on texts and social graph[J]. Front. Comput. Sci., 2023, 17(4): 174611-.
[8]	Zhe XUE, Junping DU, Xin XU, Xiangbin LIU, Junfu WANG, Feifei KOU. Few-shot node classification via local adaptive discriminant structure learning[J]. Front. Comput. Sci., 2023, 17(2): 172316-.

Viewed

Full text

Abstract

Cited

Shared

Discussed