Please wait a minute...
Frontiers of Earth Science

ISSN 2095-0195

ISSN 2095-0209(Online)

CN 11-5982/P

Postal Subscription Code 80-963

2018 Impact Factor: 1.205

Front. Earth Sci.    2019, Vol. 13 Issue (1) : 180-190    https://doi.org/10.1007/s11707-018-0704-1
RESEARCH ARTICLE
Unsupervised learning on scientific ocean drilling datasets from the South China Sea
Kevin C. TSE1(), Hon-Chim CHIU2, Man-Yin TSANG3, Yiliang LI1, Edmund Y. LAM4
1. Department of Earth Sciences, The University of Hong Kong, Pokfulam, Hong Kong, China
2. Department of Geography and Centre for Geo-computation Studies, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
3. Department of Earth Sciences, University of Toronto, Toronto, ON M5S 2M8, Canada
4. Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam, Hong Kong, China
 Download: PDF(1039 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Unsupervised learning methods were applied to explore data patterns in multivariate geophysical datasets collected from ocean floor sediment core samples coming from scientific ocean drilling in the South China Sea. Compared to studies on similar datasets, but using supervised learning methods which are designed to make predictions based on sample training data, unsupervised learning methods require no a priori information and focus only on the input data. In this study, popular unsupervised learning methods including K-means, self-organizing maps, hierarchical clustering and random forest were coupled with different distance metrics to form exploratory data clusters. The resulting data clusters were externally validated with lithologic units and geologic time scales assigned to the datasets by conventional methods. Compact and connected data clusters displayed varying degrees of correspondence with existing classification by lithologic units and geologic time scales. K-means and self-organizing maps were observed to perform better with lithologic units while random forest corresponded best with geologic time scales. This study sets a pioneering example of how unsupervised machine learning methods can be used as an automatic processing tool for the increasingly high volume of scientific ocean drilling data.

Keywords machine learning      unsupervised learning      ODP      IODP      clustering     
Corresponding Author(s): Kevin C. TSE   
Just Accepted Date: 24 April 2018   Online First Date: 01 June 2018    Issue Date: 25 January 2019
 Cite this article:   
Kevin C. TSE,Hon-Chim CHIU,Man-Yin TSANG, et al. Unsupervised learning on scientific ocean drilling datasets from the South China Sea[J]. Front. Earth Sci., 2019, 13(1): 180-190.
 URL:  
https://academic.hep.com.cn/fesci/EN/10.1007/s11707-018-0704-1
https://academic.hep.com.cn/fesci/EN/Y2019/V13/I1/180
Fig.1  Map showing locations of the four ocean drill sites used in this study.
Date Leg Site Latitude Longitude Drilled
depth
/m
Water
depth
/m
Core recovery Age at
base /Ma
Ref.
Feb?Apr
2000
146 1146 19°27.4′N 116°16.37′E 603.5 2091.7 95% 15 Moore et al., 2001;
Wang and Li, 2009
Feb?Apr
2000
146 1148 18°50.2′N 116°33.93′E 853.2 3291.8 98% 13 Moore et al., 2001;
Wang and Li, 2009
Jan?Mar
2014
349 U1431 15°22.5′N 117°00.00′E 617.0 4240.5 100% 16 Li et al., 2014;
Wang and Li, 2009
Jan?Mar
2014
349 U1433 12°55.1′N 115°02.85′E 858.5 4379.3 96% 26 Li et al., 2014;
Wang and Li, 2009
Tab.1  Summary of the datasets of the four ODP/IODP sites
Dataset Extracted Variables Sampling Frequency
MAD (Moisture andDensity) Water content (bulk), water content (dry), bulk density (g/cc), dry density (g/cc), grain density (g/cc), porosity (%) 1.5 m
RSC (Reflectance Spectrophotometry
and Colorimetry)
Reflectance values at 400, 450, 500, 550, 600, 650, 700 nm in % intensity. (For IODP sites, L?, a?, b?, tristimulus (X,Y,Z) are used instead) 4 cm
MSL (Magnetic Sus-
ceptibility)
Drift-corrected suscept.
(inst. units)
5 cm
NGR (Natural
Gamma Radiation)
Bkg-corrected counts (cps) 5 cm
GRA (Gamma Ray
Attenuation)
Density (g/cc) 5 cm
Tab.2  Summary of input datasets in this study
Class v1 v2 ... vC Sums
u1 n11 n12 ... n1C n
u2 n21 n22 ... n2C n
? ? ? ? ? ?
uR nR1 nR2 ... nRC nR·
Sums n·1 n·2 ... n·C n·· = n
Tab.3  Notation for comparing two partitions (Hubert and Arabie, 1985), also referred as contingency table
Method Index 1146 1148 U1431 U1433
deuc dman dche deuc dman dche deuc dman dche deuc dman dche
K-means RI1 0.817 0.832 0.782 0.859 0.837 0.860 0.820 0.824 0.800 0.636 0.696 0.634
RI2 0.551 0.584 0.447 0.455 0.382 0.452 0.300 0.298 0.192 0.214 0.274 0.213
H.C. RI1 0.727 0.832 0.340 0.686 0.683 0.649 0.312 0.410 0.419 0.369 0.369 0.369
RI2 0.464 0.583 0.027 0.234 0.222 0.272 0.046 0.072 0.049 0.071 0.071 0.071
SOMs RI1 0.817 0.832 0.719 0.868 0.798 0.869 0.815 0.817 0.776 0.635 0.697 0.633
RI2 0.551 0.584 0.339 0.497 0.306 0.503 0.312 0.307 0.194 0.211 0.282 0.215
RF RI1 0.832 - - 0.811 - - 0.839 - - 0.694 - -
RI2 0.577 0.246 - - 0.357 - - 0.179 - -
Tab.4  Validation of clustering results for the four SCS sites with lithological units
Method Index 1146 1148 U1431 U1433
deuc dman dche deuc dman dche deuc dman dche deuc dman dche
K-means RI1 0.835 0.840 0.829 0.835 0.813 0.831 0.616 0.707 0.580 0.593 0.586 0.599
RI2 0.381 0.554 0.520 0.381 0.320 0.368 0.188 0.381 0.098 0.189 0.163 0.214
H.C. RI1 0.708 0.768 0.544 0.447 0.444 0.600 0.430 0.432 0.433 0.375 0.375 0.375
RI2 0.405 0.406 0.166 0.081 0.073 0.169 0.050 0.044 0.016 0.074 0.074 0.074
SOMs RI1 0.773 0.791 0.722 0.838 0.816 0.833 0.573 0.578 0.461 0.591 0.588 0.599
RI2 0.386 0.447 0.309 0.389 0.345 0.375 0.118 0.130 0.003 0.190 0.163 0.217
RF RI1 0.836 - - 0.861 - - 0.731 - - 0.706 - -
RI2 0.543 - - 0.435 - - 0.425 - - 0.254 - -
Tab.5  Validation of clustering results for the four SCS sites with geological time scales
Dataset Index 1146 1148 U1431 U1433
MAD RI1 0.74 0.813 0.787 0.682
RI2 0.363 0.28 0.199 0.269
RSC RI1 0.66 0.778 0.769 0.643
RI2 0.243 0.145 0.12 0.103
MSL RI1 0.622 0.796 0.697 0.593
RI2 0.117 0.233 0.091 0.174
GRA RI1 0.679 0.748 0.78 0.59
RI2 0.231 0.102 0.154 0.049
NGR RI1 0.593 0.688 0.808 0.633
RI2 0.122 0.001 0.28 0.125
Tab.6  Clustering results for individual datasets for lithological units
Dataset Index 1146 1148 U1431 U1433
MAD RI1 0.818 0.808 0.612 0.673
RI2 0.53 0.295 0.197 0.313
RSC RI1 0.659 0.762 0.544 0.624
RI2 0.222 0.123 0.032 0.108
MSL RI1 0.685 0.763 0.445 0.558
RI2 0.233 0.163 -0.086 0.188
GRA RI1 0.702 0.767 0.584 0.475
RI2 0.253 0.156 0.135 0.027
NGR RI1 0.601 0.683 0.652 0.607
RI2 0.122 0.007 0.258 0.114
Tab.7  Clustering results for individual datasets for geological time scales
Normalization Index 1146 1148 U1431 U1433
Original RI1 0.65 0.769 0.668 0.525
RI2 0.233 0.21 0.098 0.097
x =xxσ x RI1 0.829 0.808 0.826 0.697
RI2 0.568 0.272 0.329 0.236
x =xmin?(x)max?(x) min?(x) RI1 0.828 0.846 0.82 0.622
RI2 0.567 0.403 0.3 0.168
x =ln?(xmin? (x)+1) RI1 0.817 0.859 0.815 0.636
RI2 0.551 0.455 0.267 0.214
Tab.8  Clustering results for different normalization methods on lithologic unit
Fig.2  Unsupervised clustering results compared with lithological units and geological time scales for the ODP sites 1146 and 1148. In calculating the RI, the number of clusters are set to equal to the number of lithological units or number of geological time scales. (a) 1146 K-means lith. units (cluster=4, RI1=0.832, RI2=0.584); (b) 1146 K-means geo. units (cluster=5, RI1=0.840, RI2=0.554); (c) 1148 SOMs lith. units (cluster=7, RI1=0.869, RI2=0.503); (d) 1148 RF geo. units (cluster=6, RI1=0.861, RI2=0.435).
Fig.3  Unsupervised clustering results compared with lithological units and geological time scales for the IODP sites U1431 and U1433. In calculating the RI, the number of clusters is set to equal the number of lithological units or number of geological time scales. (a) U1431 RF lith. units (cluster=9, RI1=0.839, RI2=0.357); (b) U1431 RF geo. units (cluster=5, RI1=0.731, RI2=0.425); (c) U1433 SOMs lith. units (cluster=5, RI1=0.697, RI2=0.282); (d) U1433 RF geo. units (cluster=4, RI1=0.706, RI2=0.254).
1 E WAugustijn, RZurita-Milla (2013). Self-organizing maps as an approach to exploring spatiotemporal diffusion patterns. Int J Health Geogr, 12(1): 60
https://doi.org/10.1186/1476-072X-12-60
2 JBaarsch, M Celebi (2012). Investigation of internal validity measures for k-means clustering. In: Proceedings of the International Multi Conference of Engineers and Computer Scientists
3 EBedini (2009). Mapping lithology of the Sarfartoq carbonatite complex, southern West Greenland, using HyMap imaging spectrometer data. Remote Sens Environ, 113(6): 1208–1219
https://doi.org/10.1016/j.rse.2009.02.007
4 EBedini (2012). Mapping alteration minerals at Malmbjerg molybdenum deposit, central East Greenland, by Kohonen self-organizing maps and matched filter analysis of HyMap data. Int J Remote Sens, 33(4): 939–961
https://doi.org/10.1080/01431161.2010.542202
5 DBenaouda, G Wadge, R BWhitmarsh, R GRothwell, CMacLeod (1999). Inferring the lithology of borehole rocks by applying neural network classifiers to downhole logs: an example from the ocean drilling program. Geophys J Int, 136(2): 477–491
https://doi.org/10.1046/j.1365-246X.1999.00746.x
6 F PBierlein, S J Fraser, W Brown, TLees (2008). Advanced methodologies for the analysis of databases of mineral deposits and major faults. Aust J Earth Sci, 55(1): 79–99
https://doi.org/10.1080/08120090701581406
6 LBreiman, (1984). Classification and Regression Trees. New York: Chapman & Hall, 87–91
https://doi.org/10.1080/08120090701581406
7 LBreiman (2001). Random forests. Mach Learn, 45(1): 5–32
https://doi.org/10.1023/A:1010933404324
8 C DCantrell (2000). Modern Mathematical Methods for Physicists and Engineers. Cambridge University Press
9 SChauhan, W Ruhaak, FKhan, FEnzmann, PMielke, MKersten, ISass (2016). Processing of rock core microtomogrpahy images: using seven different machine learning algorithms. Comput Geosci, 86: 120–128
https://doi.org/10.1016/j.cageo.2015.10.013
10 M JCracknell, A MReading, A WMcNeill (2014). Mapping geology and volcanic-hosted massive sulfide alteration in the Hellyer-Mt Charter region, Tasmania, using Random Forest and Self-Organising Maps. Aust J Earth Sci, 61(2): 287–304
https://doi.org/10.1080/08120099.2014.858081
11 J NGoetz, A Brenning, HPetschko, PLeopold (2015). Evaluating machine learning and statistical prediction techniques for landslide susceptibility modeling. Computers & Geosciences, 81: 1–11
12 MHalkidi, Y Batistakis, MVazirgiannis (2002). Clustering validity checking methods: part II. ACM SIGMOD Rec, 31(3): 19–27
13 LHamel (2009). Knowledge Discovery with Support Vector Machines. New York: John Wiley and Sons, 89–132
14 CHennig (2015). What are the true clusters? Pattern Recognit Lett, 64: 53–62
https://doi.org/10.1016/j.patrec.2015.04.009
15 LHubert, P Arabie (1985). Comparing partitions. J Classif, 2(1): 193–218
https://doi.org/10.1007/BF01908075
16 T LInsua, L Hamel, KMoran, L MAnderson, J MWebster (2015). Advanced classification of carbonate sediments based on physical properties. Sedimentology, 62(2): 590–606
https://doi.org/10.1111/sed.12168
17 JJeong, E Park (2016). Comparative Application of Various Machine Learning Techniques for Lithology Predictions. J Soil Groundw Environ, 21(3): 21–34
https://doi.org/10.7857/JSGE.2016.21.3.021
18 R IKabacoff (2015). R in Action- Data analysis and graphics with R.Greenwich, CT: Manning, 102–112
19 TKohonen (1982). Self-organized formation of topologically correct feature maps. Biol Cybern, 43(1): 59–69
https://doi.org/10.1007/BF00337288
20 TKohonen (2001). Self-Organizing Maps (3rd ed). New York: Springer, 132–154
21 E FKrause (1987). Taxicab Geometry- An Adventure in Non-Euclidean Geometry. Stroud, UK: Dover, 120–132
22 D JLary, A H Alavi, A H Gandomi, A L Walker (2016). Machine learning in geosciences and remote sensing. Geoscience Frontiers, 7(1): 3–10
https://doi.org/10.1016/j.gsf.2015.07.003
23 C FLi, J Lin, D KKulhanek (2014). IODP expedition 349 preliminary report, South China Sea tectonics–Opening of the South China Sea and its implications for Southeast Asian tectonics, climates and deep mantle processes since the late Mesozoic. Technical report
25 GLongo, M Brescia, SDjorgovski, SCavuoti, CDonalek (2014). Data driven discovery in astrophysics. Proceedings of ESA-ESRIN Conference: Big Data from Space 2014, Frascati, Italy
26 JMacQueen (1967). Some methods for classification and analysis of multivariate observations. In: Le Cam L M, Neyman J, eds. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California, 281–297
27 G AMarzo, T L Roush, A Blanco, SFonti, VOrofino (2006). Cluster analysis of planetary remote sensing spectral data. Journal of Geophysical Research, 111: E03002
28 GMoore, A Taira, AKlaus, KBecker, MSaffer, EScreaton (2001). Proc. ODP, Init. Repts., 190. College Station, TX (Ocean Drilling Program)
29 K PMurphy (2012). Machine Learning A Probabilistic Perspective.Cambridge: The MIT Press, 578–490
30 LPeeters, F Bação, VLobo, ADassargues (2007). Exploratory data analysis and clustering of multivariate spatial hydrogeological data by means of GEO3DSOM, a variant of Kohonen’s self-organizing map. Hydrol Earth Syst Sci, 11(4): 1309–1321
https://doi.org/10.5194/hess-11-1309-2007
31 B SPenn (2005). Using self-organizing maps to visualize high-dimensional data. Comput Geosci, 31(5): 531–544
https://doi.org/10.1016/j.cageo.2004.10.009
32 B TPham, D T Bui, I Prakash (2017a). Landslide susceptibility assessment using bagging ensemble based alternating decision trees, logistic regression and J48 decision trees methods: a comparative study. Geotech Geol Eng, 35(6): 2597–2611
33 B TPham, K Khosravi, IPrakash (2017b). Application and comparison of decision tree-based machine learning methods in landside susceptibility assessment at Pauri Garhwal area, Uttarakhand, India. Environmental Processes, 2017, 4(3): 711–730
34 B TPham, D Tien Bui, H VPham, H QLe, IPrakash, M BDholakia (2016). Landslide hazard assessment using random subspace fuzzy rules based classifier ensemble and probability analysis of rainfall data: a case study at Mu Cang Chai District, Yen Bai Province (Viet Nam). Journal of the Indian Society of Remote Sensing, 45: 673–683
35 W MRand (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336): 846–850
36 B DRipley (1996). Pattern Recognition and Neural Networks. Cambridge University Press, 248–290
37 TRomary, F Ors, JRivoirard, JDeraisme (2015). Unsupervised classification of multivariate geostatistical data: two algorithms. Comput Geosci, 85: 96–103
https://doi.org/10.1016/j.cageo.2015.05.019
38 J LSchnase, T J Lee, C A Mattmann, C S Lynnes, L Cinquini, P MRamirez, A FHart, D NWilliams, DWaliser, PRinsland, W PWebster, D QDuffy, M AMcInerney, G STamkin, G LPotter, LCarriere (2016). Big data challenges in climate science. IEEE Geosciences and Remote Sensing, 4(3): 10–22
https://doi.org/10.1109/MGRS.2015.2514192
39 MTempl, P Filzmoser, CReimann (2008). Cluster analysis applied to regional geochemical data: problems and possibilities. Appl Geochem, 23(8): 2198–2213
https://doi.org/10.1016/j.apgeochem.2008.03.004
40 P XWang, Q Y Li (2009). The South China Sea Paleoceanography and Sedimentology. New York: Springer, 388–421
24 TWarren Liao (2005). Clustering of time series data- a survey. Pattern Recognit, 38(11): 1857–1874
https://doi.org/10.1016/j.patcog.2005.01.025
41 M JWay, J D Scargle, K M Ali, A N Srivastava (2012). Advances in Machine Learning and Data Mining for Astronomy. New York: CRC Press, 240–312
42 RWehrens, L M C Buydens (2007). Self- and super-organising maps in R: the Kohonen package. Journal of Statistical Software, 21(5):1–19
43 YXiong, R Zuo (2016). Recognition of geochemical anomalies using a deep autoencoder network. Comput Geosci, 86: 75–82
https://doi.org/10.1016/j.cageo.2015.10.006
44 XYao, L G Tham, F C Dai (2008). Landslide susceptibility mapping based on Support Vector Machine: a case study on natural slopes of Hong Kong, China. Geomorphology, 101(4): 572–582
https://doi.org/10.1016/j.geomorph.2008.02.011
[1] Kevin C. Tse, Hon-Chim Chiu, Man-Yin Tsang, Yiliang Li, Edmund Y. Lam. An unsupervised learning approach to study synchroneity of past events in the South China Sea[J]. Front. Earth Sci., 2019, 13(3): 628-640.
[2] Feng MAO,Minhe JI,Ting LIU. Mining spatiotemporal patterns of urban dwellers from taxi trajectory data[J]. Front. Earth Sci., 2016, 10(2): 205-221.
[3] Nana SHI, Jinyan ZHAN, Feng WU, Jifu DU. Identification of the core ecosystem services and their spatial heterogeneity in Poyang Lake area[J]. Front Earth Sci Chin, 2009, 3(2): 214-220.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed