Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2024, Vol. 18 Issue (6) : 186506    https://doi.org/10.1007/s11704-023-2616-9
RESEARCH ARTICLE
IP2vec: an IP node representation model for IP geolocation
Fan ZHANG1,2, Meijuan YIN1,2(), Fenlin LIU1,2, Xiangyang LUO1,2, Shuodi ZU1,2
1. State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, China
2. Key Laboratory of Cyberspace Situation Awareness of Henan Province, Zhengzhou 450001, China
 Download: PDF(17087 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

IP geolocation is essential for the territorial analysis of sensitive network entities, location-based services (LBS) and network fraud detection. It has important theoretical significance and application value. Measurement-based IP geolocation is a hot research topic. However, the existing IP geolocation algorithms cannot effectively utilize the distance characteristics of the delay, and the nodes’ connection relation, resulting in high geolocation error. It is challenging to obtain the mapping between delay, nodes’ connection relation, and geographical location. Based on the idea of network representation learning, we propose a representation learning model for IP nodes (IP2vec for short) and apply it to street-level IP geolocation. IP2vec model vectorizes nodes according to the connection relation and delay between nodes so that the IP vectors can reflect the distance and topological proximity between IP nodes. The steps of the street-level IP geolocation algorithm based on IP2vec model are as follows: Firstly, we measure landmarks and target IP to obtain delay and path information to construct the network topology. Secondly, we use the IP2vec model to obtain the IP vectors from the network topology. Thirdly, we train a neural network to fit the mapping relation between vectors and locations of landmarks. Finally, the vector of target IP is fed into the neural network to obtain the geographical location of target IP. The algorithm can accurately infer geographical locations of target IPs based on delay and topological proximity embedded in the IP vectors. The cross-validation experimental results on 10023 target IPs in New York, Beijing, Hong Kong, and Zhengzhou demonstrate that the proposed algorithm can achieve street-level geolocation. Compared with the existing algorithms such as Hop-Hot, IP-geolocater and SLG, the mean geolocation error of the proposed algorithm is reduced by 33%, 39%, and 51%, respectively.

Keywords IP geolocation      network measurement      node embedding     
Corresponding Author(s): Meijuan YIN   
About author:

Peng Lei and Charity Ngina Mwangi contributed equally to this work.

Just Accepted Date: 28 July 2023   Issue Date: 20 October 2023
 Cite this article:   
Fan ZHANG,Meijuan YIN,Fenlin LIU, et al. IP2vec: an IP node representation model for IP geolocation[J]. Front. Comput. Sci., 2024, 18(6): 186506.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-2616-9
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I6/186506
Fig.1  Node2vec random walk strategy
Fig.2  Schematic diagram of the IP2vec model
Notations Descriptions
G Network topology, including nodes and edges
V Nodes in G (including vantage points, landmarks, target and routers)
E The edge between nodes in G
Du,v Minimum delay between u and v
Φ(u) The representation vector of node u
R The Number of nodes in the topology
M The Number of walks per node
A A set of node walking sequences
Bu The number of nodes connected to node u
πu,v The transition probability between nodes u and v
Ht,x The minimum number of hops from node t to node x
W During a walk, the node set generated in the current sequence
Tab.1  List of notations
Fig.3  Cluster-first random walk strategy
Fig.4  Node2vec random walk strategy
Fig.5  Small network topology (Delay in edge is in milliseconds)
Nodes x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x0
x1 0 100 100 120 140 130 110 110 130 150 40
x2 100 0 200 220 240 30 10 10 230 250 140
x3 100 200 0 20 40 230 210 210 30 50 140
x4 120 220 20 0 20 250 230 230 10 30 160
x5 140 240 40 20 0 270 250 250 30 10 180
x6 130 30 230 250 270 0 40 40 260 280 170
x7 110 10 210 230 250 40 0 20 240 260 150
x8 110 10 210 230 250 40 20 0 240 260 150
x9 130 230 30 10 30 260 240 240 0 40 170
x10 150 250 50 30 10 280 260 260 40 0 190
x0 40 140 140 160 180 170 150 150 170 190 0
Tab.2  Minimum delays between nodes
Node sequences
A1: x7 x2 x7 x2 x6 x2 x6 x2 x1 x3
A2: x8 x2 x1 x5 x10 x5 x1 x3 x1 x0
A3: x6 x2 x8 x2 x1 x0 x1 x3 x1 x6
A4: x8 x2 x6 x2 x8 x2 x7 x2 x6 x2
Tab.3  Example of node sequences
Node IP vector
x6 0.766 3.280 ?4.614 ?0.568 ?2.233
x7 ?0.832 3.916 ?2.837 1.185 ?0.511
x8 ?1.431 3.032 ?2.391 0.787 ?1.438
x9 ?0.263 ?1.165 0.338 ?0.266 2.578
x10 ?0.085 ?2.494 1.931 ?0.385 5.829
Tab.4  IP vectors of terminal nodes
IP vector Φ(x6) Φ(x7) Φ(x8) Φ(x9) Φ(x10)
Φ(x6) 1 0.8245 0.8294 ?0.6322 ?0.7324
Φ(x7) 0.8245 1 0.9568 ?0.4757 ?0.5525
Φ(x8) 0.8294 0.9568 1 ?0.6187 ?0.7048
Φ(x9) ?0.6322 ?0.4757 ?0.6187 1 0.9807
Φ(x10) ?0.7324 ?0.5525 ?0.7048 0.9807 1
Tab.5  The cosine similarity between IP vectors
Fig.6  Network topology containing hosts with similar structures (Delay in edge is in milliseconds)
Node IP vector
x3 3.665 ?1.332 2.385 ?2.208 0.048
x4 2.357 ?0.325 1.872 ?2.581 0.296
x6 ?3.072 3.115 ?0.369 ?1.987 3.201
x7 ?5.688 0.914 ?0.490 0.648 4.091
Tab.6  IP vectors of hosts with similar structures
IP vector Φ(x3) Φ(x4) Φ(x6) Φ(x7)
Φ(x3) 1 0.9505 ?0.4000 ?0.6777
Φ(x4) 0.9505 1 ?0.1243 ?0.5316
Φ(x6) ?0.4000 ?0.1243 1 0.7852
Φ(x7) ?0.6777 -0.5316 0.7852 1
Tab.7  The cosine similarity between IP vectors of hosts with similar structures
Fig.7  Relation between vector similarity and delay, hops between nodes
Node representation model Node2vec [20] IP2vec (Only use C) IP2vec (C with delay constraint)
Number of clusters 230 227 183
Mean distance/km 17.18 13.90 10.72
Tab.8  Mean distance of nodes within cluster with different node representation model
Fig.8  Geographical distribution of some landmarks after clustering. (a) Node2vec; (b) IP2vec (Only use C); (c) IP2vec (C with delay constraint)
Notations Descriptions
IPg IP address of landmark node g
Loc(IPg) Location of landmark g
lngg,latg Latitude and longitude of landmark node g
IPT IP address of target node T
da,p,q In the a-th measurement, the delay from vantage point p to node q
χ(q,e) The number of hops between node q and node e
Tab.9  List of notations in IP geolocation algorithm
Fig.9  The framework of the IP geolocation algorithm
Fig.10  Structure diagram of neural network
Fig.11  Geolocation schematic diagram
Fig.12  Network topology with different delay and distance conversion relations
Number of landmarks New York 2038
Beijing 2964
Hong Kong 2043
Zhengzhou 2978
Total 10023
Vantage points deployment China: Four vantage points are deployed in Beijing, Chengdu, Shanghai and Wuhan
America: Five vantage points are deployed in Washington, Silicon Valley, New York, Atlanta and Seattle
Number of measurements 8900 times
Algorithms for comparison SLG algorithm [3], IP-geolocater algorithm [17], Hop-Hot algorithm [19]
Probe protocol ICMP-PARIS [26]
Tab.10  Experimental settings
ω Geolocation median error/km
Beijing Zhengzhou Hong Kong New York
0 9.33 2.82 8.78 8.19
0.1 6.54 2.52 3.82 7.49
0.2 6.43 2.46 3.49 6.74
0.3 6.46 2.44 5.13 6.55
0.4 6.89 3.06 4.45 7.05
0.5 6.63 3.20 5.56 8.24
0.6 7.74 4.84 5.96 8.52
0.7 8.57 5.44 9.01 8.63
0.8 10.14 5.28 8.69 8.30
0.9 11.86 7.30 9.07 8.81
1.0 13.01 7.99 9.20 9.23
Tab.11  Median geolocation error of different cities and different geolocation parameter ω
Fig.13  Cumulative distribution of geolocation error. (a) Hong Kong; (b) Beijing; (c) New York; (d) Zhengzhou
ω Mean geolocation error/km
Beijing Zhengzhou Hong Kong New York
0 10.79 3.39 9.32 9.35
0.1 8.09 3.05 5.77 8.46
0.2 7.68 3.16 5.25 8.65
0.3 8.17 3.10 6.49 7.50
0.4 8.16 3.81 5.95 8.38
0.5 8.09 3.87 6.52 9.69
0.6 9.37 5.26 7.47 9.65
0.7 9.95 5.89 10.14 10.38
0.8 10.79 5.74 9.29 9.88
0.9 12.39 7.70 10.38 10.87
1.0 13.89 8.20 10.69 10.64
Tab.12  Mean geolocation error of different cities and different geolocation parameter ω
Fig.14  Median and mean geolocation error distribution under different parameter ω. (a) Hong Kong; (b) Beijing; (c) New York; (d) Zhengzhou
City Number of vantage points Number of router nodes in the topology Geolocation error/km
Median error Max error
Beijing 3 235 9.82 52.12
6 317 9.11 51.53
9 460 6.46 46.44
Zhengzhou 3 221 7.87 18.90
6 334 4.65 18.94
9 463 2.44 17.13
Hong Kong 3 230 7.03 44.08
6 329 6.38 44.64
9 473 5.13 39.40
New York 3 215 7.81 59.81
6 313 7.74 48.29
9 462 6.55 26.30
Tab.13  Comparison of geolocation performance with different numbers of vantage points
City Mean geolocation error/km
Number of landmarks with similar vectors < 5 5 < Number of landmarks with similar vectors < 10 Number of landmarks with similar vectors > 10
Beijing 9.88 7.48 5.83
Zhengzhou 6.38 3.69 2.55
Hong Kong 10.62 6.14 4.41
New York 11.95 10.34 6.97
Tab.14  Influence of the number of landmarks on geolocation error. Only those landmarks with similarity to the target IP vector greater than 0.9 is included
Fig.15  Comparison of the cumulative distribution of geolocation errors between multi-city mixed training and single-city training
Fig.16  Comparison of the cumulative distribution of geolocation errors. (a) Hong Kong; (b) Beijing; (c) New York; (d) Zhengzhou
City Geolocation algorithm Geolocation error/km
Median Mean Max
Beijing Proposed algorithm 6.43 7.68 55.23
SLG algorithm [3] 13.85 16.64 59.81
IP-geolocater algorithm [17] 9.86 13.29 55.48
Hop-Hot algorithm [19] 10.70 12.81 58.47
New York Proposed algorithm 6.55 7.50 26.30
SLG algorithm [3] 11.42 12.21 42.24
IP-geolocater algorithm [17] 8.55 9.85 40.21
Hop-Hot algorithm [19] 6.60 7.91 33.83
Hong Kong Proposed algorithm 3.49 5.25 38.99
SLG algorithm [3] 8.52 10.51 40.97
IP-geolocater algorithm [17] 7.43 8.54 39.38
Hop-Hot algorithm [19] 7.57 8.81 39.20
Zhengzhou Proposed algorithm 2.44 3.10 17.13
SLG algorithm [3] 6.76 7.38 20.04
IP-geolocater algorithm [17] 6.19 6.50 18.69
Hop-Hot algorithm [19] 3.75 4.73 20.26
Tab.15  Comparison of geolocation errors
  
  
  
  
  
1 J A, Muir Oorschot P C Van . Internet geolocation: evasion and counterevasion. ACM Computing Surveys, 2009, 42( 1): 4
2 Dan O, Parikh V, Davison B D. Improving IP geolocation using query logs. In: Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016, 347–356
3 Y, Wang D, Burgener M, Flores A, Kuzmanovic C Huang . Towards street-level client-independent IP geolocation. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. 2011, 365–379
4 F, Zhao R, Xu R, Li M, Zhu X Luo . Street-level geolocation based on router multilevel partitioning. IEEE Access, 2019, 7: 59237–59248
5 B, Gueye A, Ziviani M, Crovella S Fdida . Constraint-based geolocation of internet hosts. IEEE/ACM Transactions on Networking, 2006, 14( 6): 1219–1232
6 B, Wong I, Stoyanov E G Sirer . Octant: a comprehensive framework for the geolocalization of internet hosts. In: Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation. 2007, 23
7 Arif M J, Karunasekera S, Kulkarni S. GeoWeight: internet host geolocation based on a probability model for latency measurements. In: Proceedings of the 33rd Australasian Conferenc on Computer Science - Volume 102. 2010, 89–98
8 Z, Dong R D W, Perera R, Chandramouli K P Subbalakshmi . Network measurement based modeling and optimization for IP geolocation. Computer Networks, 2012, 56( 1): 85–98
9 G, Ciavarrini M S, Greco A Vecchio . Geolocation of internet hosts: accuracy limits through Cramér–Rao lower bound. Computer Networks, 2018, 135: 70–80
10 Q, Li Z, Wang D, Tan J, Song H, Wang L, Sun J Liu . GeoCAM: an IP-based geolocation service through fine-grained and stable webcam landmarks. IEEE/ACM Transactions on Networking, 2021, 29( 4): 1798–1812
11 O, Dan V, Parikh B D Davison . IP geolocation through reverse DNS. ACM Transactions on Internet Technology, 2022, 22( 1): 17
12 O, Dan V, Parikh B D Davison . IP geolocation through geographic clicks. ACM Transactions on Spatial Algorithms and Systems, 2022, 8( 1): 2
13 S, Zu X, Luo S, Liu Y, Liu F Liu . City-level IP geolocation algorithm based on PoP network topology. IEEE Access, 2018, 6: 64867–64875
14 F, Zhao X, Luo Y, Gan S, Zu Q, Cheng F Liu . IP geolocation based on identification routers and local delay distribution similarity. Concurrency and Computation: Practice and Experience, 2019, 31( 22): e4722
15 J N, Chen F L, Liu Y F, Shi X Luo . Towards IP location estimation using the nearest common router. Journal of Internet Technology, 2018, 19( 7): 2097–2110
16 S, Ding F, Zhao X Luo . A street-level IP geolocation method based on delay-distance correlation and multilayered common routers. Security and Communication Networks, 2021, 2021: 6658642
17 S, Zu X, Luo F Zhang . IP-geolocater: a more reliable IP geolocation algorithm based on router error training. Frontiers of Computer Science, 2022, 16( 1): 161504
18 H, Jiang Y, Liu J N Matthews . IP geolocation estimation using neural networks with stable landmarks. In: Proceedings of 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). 2016, 170–175
19 F, Zhang F, Liu X Luo . Geolocation of covert communication entity on the internet for post-steganalysis. EURASIP Journal on Image and Video Processing, 2020, 2020( 1): 15
20 A, Grover J Leskovec . node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, 855–864
21 T, Mikolov K, Chen G, Corrado J Dean . Efficient estimation of word representations in vector space. In: Proceedings of the 1st International Conference on Learning Representations. 2013
22 F, Niu B, Recht C, Re S J Wright . HOGWILD!: a lock-free approach to parallelizing stochastic gradient descent. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. 2011, 693–701
23 E, Schubert J, Sander M, Ester H P, Kriegel X Xu . DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems, 2017, 42( 3): 19
24 R P Lippmann . Pattern classification using neural networks. IEEE Communications Magazine, 1989, 27( 11): 47–50
25 Z, Tao Y Liu . A regional network topology construction algorithm based on sampling measurement. Journal of Physics: Conference Series, 2021, 1861( 1): 012025
26 B, Augustin X, Cuvellier B, Orgogozo F, Viger T, Friedman M, Latapy C, Magnien R Teixeira . Avoiding traceroute anomalies with Paris traceroute. In: Proceedings of the 6th ACM SIGCOMM Conference on Internet Measurement. 2006, 153–158
27 X, Liu W, Yang M, Yin F, Liu C Yun . Street-level landmark mining algorithm based on radar search. Journal of Internet Technology, 2021, 22( 2): 283–295
28 R, Li Y, Liu Y, Qiao T, Ma B, Wang X Luo . Street-level landmarks acquisition based on SVM classifiers. Computers, Materials & Continua, 2019, 59( 2): 591–606
29 T, Ma F, Liu X, Luo M, Yin R Li . An algorithm of street-level landmark obtaining based on yellow pages. Journal of Internet Technology, 2019, 20( 5): 1415–1428
30 N, Spring R, Mahajan D Wetherall . Measuring ISP topologies with rocketfuel. ACM SIGCOMM Computer Communication Review, 2002, 32( 4): 133–145
31 R, Govindan H Tangmunarunkit . Heuristics for internet map discovery. In: Proceedings of IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies. 2000, 1371–1380
32 M H, Gunes K Sarac . Resolving anonymous routers in internet topology measurement studies. In: Proceedings of IEEE INFOCOM 2008 - The 27th Conference on Computer Communications. 2008, 1076–1084
33 T, Mikolov I, Sutskever K, Chen G, Corrado J Dean . Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3111–3119
[1] FCS-22616-OF-FZ_suppl_1 Download
[1] Shuodi ZU, Xiangyang LUO, Fan ZHANG. IP-geolocater: a more reliable IP geolocation algorithm based on router error training[J]. Front. Comput. Sci., 2022, 16(1): 161504-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed