Please wait a minute...
Frontiers of Earth Science

ISSN 2095-0195

ISSN 2095-0209(Online)

CN 11-5982/P

Postal Subscription Code 80-963

2018 Impact Factor: 1.205

Front. Earth Sci.    2019, Vol. 13 Issue (3) : 628-640    https://doi.org/10.1007/s11707-019-0748-x
RESEARCH ARTICLE
An unsupervised learning approach to study synchroneity of past events in the South China Sea
Kevin C. Tse1(), Hon-Chim Chiu2, Man-Yin Tsang3, Yiliang Li1, Edmund Y. Lam4
1. Department of Earth Sciences, The University of Hong Kong, Pokfulam, Hong Kong, China
2. Department of Geography and Centre for Geo-computation Studies, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
3. Department of Earth Sciences, University of Toronto, Canada
4. Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam, Hong Kong, China
 Download: PDF(1218 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Unsupervised machine learning methods were applied on multivariate geophysical and geochemical datasets of ocean floor sediment cores collected from the South China Sea. The well-preserved and continuous core samples comprising high resolution Cenozoic sediment records enable scientists to carry out paleoenvironment studies in detail. Bayesian age-depth chronological models constructed from biostratigraphic control points for the drilling sites are applied on cluster boundaries generated from two popular unsupervised learning methods: K-means and random forest. The unsupervised learning methods experimented have produced compact and unambiguous clusters from the datasets, indicating that previously unknown data patterns can be revealed when all variables from the datasets are taken into account simultaneously. A study of synchroneity of past events represented by the cluster boundaries across geographically separated ocean drilling sites is achieved through converting the fixed depths of cluster boundaries into chronological ranges represented by Gaussian density plots which are then compared with known past events in the region. A Gaussian density peak at around 7.2 Ma has been identified from results of all three sites and it is suggested to coincide with the initiation of the East Asian monsoon. Contrary to traditional statistical approach, a priori assumptions are not required for unsupervised learning, and the clustering results serve as a novel data-driven proxy for studying the complex and dynamic processes of the paleoenvironment surrounding the ocean sediment. This work serves as a pioneering approach to extract valuable information of regional events and opens up a systematic and objective way to study the vast global ocean sediment datasets.

Keywords machine learning      ocean sediments      unsupervised classification     
Corresponding Author(s): Kevin C. Tse   
Just Accepted Date: 20 March 2019   Online First Date: 08 August 2019    Issue Date: 15 October 2019
 Cite this article:   
Kevin C. Tse,Hon-Chim Chiu,Man-Yin Tsang, et al. An unsupervised learning approach to study synchroneity of past events in the South China Sea[J]. Front. Earth Sci., 2019, 13(3): 628-640.
 URL:  
https://academic.hep.com.cn/fesci/EN/10.1007/s11707-019-0748-x
https://academic.hep.com.cn/fesci/EN/Y2019/V13/I3/628
Site (Hole) Water depth/m Latitude Longitude Penetration/mbsf Age at Base/Ma Reference
1143A 2772 9°21.72′N 113°17.11′W 394 16 Wang et al. (2000)
1146A 2092 19°27.401′N 116°16.363′E 607 19 Wang et al. (2000)
1148A 3292 18°50.167′N 116°33.932′E 694 32 Wang et al. (2000)
Tab.1  Summary of ODP Leg 184 sites selected in this study
Dataset Type Extracted Variables
Moisture and Density Geophysical water content (bulk), water content (dry), bulk density, dry density, grain density, porosity, void ratio
Gamma Ray Attenuation Geophysical density
Natural Gamma Radiation Geophysical bkg corrected counts
Magnetic Susceptibility Geophysical Drift-corrected suscept.
Reflectance Spectrophotometry and Colorimetry Geophysical L*, a*, b*
Smear Slide Geophysical sand, silt, clay
Carbonate Geochemical inorganic carbon, carbonate, total carbon, organic carbon, nitrogen, sulphur
Gas Chromatography Geochemical methane
Interstitial Water Geochemical ammonia, calcium, chloride, lithium, magnesium, phosphorus, phosphate
Tab.2  Summary of input datasets including 30 geophysical and geochemical variables from the ODP sites
Fig.1  Bayesian Age Chronology Plot of Site 1143. Red shaded area represents 95% highest poterior density regions (HDR) (Parnell et al., 2008), indicating the uncertainty of the ages between the dated depths of the control markers.
Fig.2  Bayesian Age Chronology Plot of Site 1146. Red shaded area represents 95% highest poterior density regions (HDR) (Parnell et al., 2008), indicating the uncertainty of the ages between the dated depths of the control markers.
Fig.3  Bayesian Age Chronology Plot of Site 1148. Red shaded area represents 95% highest poterior density regions (HDR) (Parnell et al., 2008), indicating the uncertainty of the ages between the dated depths of the control markers.
Fig.4  Total within-groups sums of squares against the number of clusters.
Fig.5  Unsupervised machine learning pipeline for ODP datasets.
Fig.6  Data clusters assigned by class numbers 1 to 6 by K-means (left) and RF (right) for ODP site 1143A. Each data point is only assigned one class and the class number is arbitrary for the two adopted unsupervised classification methods. (a) 1143A K-means (b) 1143A RF.
Cluster boundary (mcd) Bayesian chronological range/Ma
36.4–40.4 0.3–1.6
105.1–109.1 1.5–2.1
189.9–193.9 3.1–5.0
299.0–303.0 6.5–7.7
371.7–375.8 7.0–8.2
Tab.3  K-means class boundaries and Bayesian chronological ranges for site 1143 (Li et al., 2004)
Cluster boundary (mcd) Bayesian chronological range/Ma
44.4–48.5 0.3–1.6
125.3–129.3 2.0–4.0
218.2–222.2 5.5–5.8
303.0–307.1 6.6–7.8
367.7–371.8 7.0–8.2
Tab.4  RF class boundaries and Bayesian chronological ranges for site 1143 (Li et al., 2004)
Fig.7  Data clusters assigned by class numbers 1 to 6 by K-means (left) and RF (right) for ODP site 1146A. Each data point is only assigned one class and the class number is arbitrary for the two adopted unsupervised classification methods. (a) 1146A K-means; (b) 1146A RF.
Cluster boundary (mcd) Bayesian chronological range/Ma
78.8–88.8 0.2–0.9
157.6–163.6 1.8–3.0
248.5–254.5 2.9–4.5
339.4–345.5 6.8–7.6
503.0–509.1 12.0–14.0
Tab.5  K-means class boundaries and Bayesian chronological ranges for site 1146 (Li et al., 2004)
Cluster boundary (mcd) Bayesian chronological range/Ma
163.6–169.7 1.5–2.6
260.6–266.7 3.0–4.9
369.7–375.8 7.2–7.8
454.5–460.6 10.9–11.8
515.2–521.2 13.0–14.5
Tab.6  RF class boundaries and Bayesian chronological ranges for site 1146 (Li et al., 2004)
Fig.8  Data clusters assigned by class numbers 1 to 6 by K-means (left) and RF (right) for ODP site 118A. Each data point is only assigned one class and the class number is arbitrary for the two adopted unsupervised classification methods. (a) 1148A K-means; (b) 1148A RF.
Cluster boundary (mcd) Bayesian chronological range/Ma
11.7–14.6 0.1–1.6
55.7–58.6 0.2–1.7
93.7–96.7 0.3–1.9
164.0–167.0 2.7–4.0
216.8–219.7 6.5–7.5
Tab.7  K-means class boundaries and Bayesian chronological ranges for site 1148, max depth is limited to 290 to match 1146 and 1143 (Li and Li, 2004)
Cluster boundary (mcd) Bayesian chronological range/Ma
55.7–58.6 0.2–1.7
96.7–99.6 0.3–1.8
137.7–140.6 2.1–3.1
175.8–178.7 4.0–4.8
216.8–219.7 6.1–6.3
Tab.8  RF class boundaries and Bayesian chronological ranges for site 1148, max depth is limited to 290 to match 1146 and 1143 (Li and Li, 2004)
Fig.9  Mixed Gaussian density plots for the cluster boundaries produced from K-means (left) and RF (right) at site 1143. The blue line indicates the commencement of east Asian monsoon at 7.2 Ma (An, 2000).
Fig.10  Mixed Gaussian density plots for the cluster boundaries produced from K-means (left) and RF (right) at site 1146. The blue line indicates the commencement of east Asian monsoon at 7.2Ma (An, 2000).
Fig.11  Mixed Gaussian density plots for the cluster boundaries produced from K-means (left) and RF (right) at site 1148. The blue line indicates the commencement of east Asian monsoon at 7.2 Ma (An, 2000).
Depth (mcd) Top/Bottom Thickness Age/Ma Datum Stratigraphic position Ref.
93.5/94.29 0.8 1.69 FO medium Gephyrocapsa spp. - Wang et al., 2000
190.8/200.6 9.8 4.99 LO C. acutus - Wang et al., 2000
216.6/219.6 3 5.54 FO Sphaeroidinella dehiscens N18 / N19 Nathan and Leckie, 2003
224.07/232.52 8.5 5.82 FO Globorotalia tumida N17b / N18 Nathan and Leckie, 2003
238.52/241.05 3.5 6.4 FO Pulleniatina primalis N17a / N17b Nathan and Leckie, 2003
453.06/456.06 3 8.58 FO Globorotalia plesiotumida N16 / N17a Nathan and Leckie, 2003
  Table A1?Astrochronologically tuned Planktonic foraminifer biostratigraphic zonal boundaries, Site 1143 (~45 to ~93 ky resolution). FO and LO represents the bioevents of first occurrence and last occurrence of species respectively into biozones (Nathan and Leckie, 2003)
Depth (mcd) Top/Bottom Thickness Age/Ma Datum Stratigraphic position Ref.
83.9/93.4 9.5 0.46 LO P. Iacunosa - Wang et al., 2000
226.7/237.1 10.4 2.83 LO D. tamalis - Wang et al., 2000
318.36/321.41 3.1 5.54 FO Sphaeroidinella dehiscens N18 / N19 Nathan and Leckie, 2003
321.41/324.36 3.0 5.82 FO Globorotalia tumida N17b / N18 Nathan and Leckie, 2003
337.56/338.52 1.0 6.4 FO Pulleniatina primalis N17a / N17b Nathan and Leckie, 2003
406.63/409.63 3 8.58 FO Globorotalia plesiotumida N16 / N17a Nathan and Leckie, 2003
432.38/433.88 1.5 9.82 FO Neogloboquadrina acostaensis N15 / N16 Nathan and Leckie, 2003
443.83/ 445.33 1.5 10.49 LO Paragloborotalia mayeri N14 / N15 Nathan and Leckie, 2003
471.92/ 473.42 1,5 11.19 FO Globoturborotalita nepenthes N13 / N14 Nathan and Leckie, 2003
504.1/505.6 1.5 13.42 FO Globorotalia fohsi s.l. N11 / N12 Nathan and Leckie, 2003
  Table A2?Astrochronologically tuned Planktonic foraminifer biostratigraphic zonal boundaries, Site 1146. FO and LO represents the bioevents of first occurrence and last occurrence of species respectively into biozones (Nathan and Leckie, 2003)
Depth (mcd)
Top/Bottom
Thickness Age/
Ma
Datum Ref.
115.34/125.76 10.42 1.69 FO medium Gephyrocapsa spp. Wang et al., 2000
176.83/176.98 0.15 4.20 LO Globoturborotalita nepenthes Li et al., 2004
188.23/189.08 0.85 5.54 FO Sphaeroidinella dehiscens Li et al., 2004
195.98/196.18 0.2 5.82 FO Globorotalia tumida Li et al., 2004
206.71/206.88 0.17 6.2 FO Globigerinoides conglobatus Li et al., 2004
244.18/244.33 0.15 8.3 FO Globigerinoides extremus Li et al., 2004
257.23/257.08 0.15 9.5-9.8 LO Globoquadrina dehiscens Li et al., 2004
259.63/259.78 0.15 9.82 FO Neogloboquadrina acostaensis Li et al., 2004
275.21/275.23 0.02 10.49 LO Paragloborotalia mayeri Li et al., 2004
283.63/283.93 0.3 11.19 FO Globoturborotalita nepenthes Li et al., 2004
301.03/301.00 0.03 13.00 LO Globorotalia fohsi Li et al., 2004
303.43/303.13 0.3 13.42 FO Globorotalia fohsi Li et al., 2004
  Table A3 Age-Depth Model based on planktonic foraminifer datums, Site 1148. FO and LO represents the bioevents of first occurrence and last occurrence of species respectively into biozones (Li et al., 2004)
1 R B Alley, P A Mayewski, T Sowers, M Stuiver, K C Taylor, P U Clark (1997). Holocene climatic instability: a prominent, widespread event 8200 yr ago. Geology, 25(6): 483–507
2 Z An (2000). The history and variability of the East Asian paleomonsoon climate. Quat Sci Rev, 19(1): 171–187
3 D Benaouda, G Wadge, R B Whitmarsh, R G Rothwell, C MacLeod (1999). Inferring the lithology of borehole rocks by applying neural network classifiers to downhole logs: an example from the ocean drilling program. Geophys J Int, 136(2): 477–491
4 K D Bennett, J L Fuller (2002). Determining the age of the Mid-Holocene Tsuga canadensis (hemlock) decline, eastern North America. Holocene, 12(4): 421–429
5 H J B Birks (1989). Holocene isochrone maps and patterns of tree-spreading in the British isles. J Biogeogr, 16(6): 503–540
6 L Breiman (1984). Classification and Regression Trees. New York: Chapman & Hall
7 L Breiman (2001). Random forests. Mach Learn, 45: 5–32
8 S Chauhan, W Ruhaak, F Khan, F Enzmann , P Mielke , M Kersten , I. Sass (2016). Processing of rock core microtomogrpahy images: using seven different machine learning algorithms. Comput Geosci, 86: 120–128
9 P Cheeseman, M Self, J Kelly, W Taylor, D Freeman, J Stutz (1988). Bayesian classification. In: Proceedings of the Seventh AAAI National Conference on Artificial Intelligence. AAAI’88. New York: AAAI Press, 607–611
10 M J Cracknell, A M Reading, A W McNeill (2014). Mapping geology and volcanic hosted massive sulfide alteration in the Hellyer-Mt Charter region, Tasmania, using random forest and self-organising maps. Aust J Earth Sci, 61: 287–304
11 M H A Davis (1984). Piecewise-deterministic markov processes: a general class of non-diffusion stochastic models (with discussion). J R Stat Soc B, 46: 353–388
12 Exp. 349 scientists. (2014). IODP expedition 349 preliminary report, South China Sea tectonics- opening of the South China Sea and its implications for southeast asian tectonics, climates and deep mantle processes since the late mesozoic. Initial reports. New York: IODP
13 J N Goetz, A Brenning, H Petschko, P Leopold (2015). Evaluating machine learning and statistical prediction techniques for landslide susceptibility modeling. Comput Geosci, 81: 1–11
14 J Haslett, A Parnell (2008). A simple monotone process with application to radiocarbon-dated depth chronologies. J R Stat Soc Ser C Appl Stat, 57(4): 399–418
https://doi.org/10.1111/j.1467-9876.2008.00623.x
15 R Hazen (2014). Data-driven abductive discovery in mineralogy. Am Mineral, 99: 2165–2170
16 C Hennig (2016). What are the true clusters? Pattern Recognit Lett, 64: 53–62
17 T L Insua, L Hamel, K Moran, L M Anderson, J M Webster (2015). Advanced classification of carbonate sediments based on physical properties. Sedimentology, 62: 590–606
18 R Isabella, J Backman, E Fornaciari. (2006). A review of calcareous nannofossil astrobiochronology encompassing the past 25 million years. Quat Sci Rev, 25: 3113–3137
19 A K Jain (2010). Data clustering: 50 years beyond k-means. Pattern Recognit Lett, 31: 651–666
20 B Jorgensen (1987). Exponential dispersion models. J R Stat Soc B, 49: 127–162
21 R I Kabacoff (2015). R in Action- Data analysis and graphics with R. San Jose: Manning
22 T Kohonen (2001). Self-Organizing Maps. New York: Springer-Vertag
23 D J Lary, A H Alavi, A H Gandomi, L W Walker. (2016). Machine learning in geosciences and remote sensing. Geoscience Frontiers, 7: 3–10
24 Q Li, Z Jian, B Li (2004). Oligocene-miocene planktonic foraminiferal biostratigraphy, site 1148, northern South China Sea. In: Proceedings of ODP Sci. Results. New York: IODP, 184 (1): 1–26
25 T W Liao (2005). Clustering of time series data—a survey. Pattern Recognit, 38: 1857–1874
26 Y Liu, R H Weisberg (2005). Patterns of ocean current variability on the west florida shelf using the selforganizing map. J Geophys Res Oceans, 110(C6): 0148–0227
27 Y Liu, R H Weisberg (2011). A review of self-organizing map applications in meteorology and oceanography. In: Mwasiagi J I, ed. Self-Organizing Maps—Applications and Novel Algorithm Design. Rijeka, Croatia: Intech, 253–272
28 J MacQueen (1967). Some methods for classification and analysis of multivariate observations. In: Le Cam L M, Neyman J, eds. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. San Francisco: University of California, 281–297
29 K P Murphy (2012). Machine Learning A Probabilistic Perspective. New York: The MIT Press
30 T Nakamori (2001). Global carbonate accumulation rates from cretaceous to present and their implications for the carbon cycle model. Isl Arc, 10(1): 1–8
31 S Nathan, R Leckie (2003). Miocene planktonic foraminiferal biostratigraphy of sites 1143 and 1146, ODP leg 184, South China Sea. Proc ODP, Sci Results, 184 (1): 1–43
32 A Parnell, J Haslett, J Allen, C Buck, B Huntley (2008). A flexible approach to assessing synchroneity of past events using bayesian reconstructions of sedimentation history. Quat Sci Rev, 27(19): 1872–1885
33 E Pavlidou, M van der Meijde, H van der Werff, C Hecker (2016). Finding a needle by removing the haystack: a spatio-temporal normalization method for geophysical data. Comput Geosci, 90: 78–86
34 B S Penn (2005). Using self-organizing maps to visualize high-dimensional data. Comput Geosci, 31(5): 531–544
35 B T Pham, D T Bui, I Prakash (2017). Landslide susceptibility assessment using bagging ensemble based alternating decision trees, logistic regression and J48 decision trees methods: a comparative study. London. Geotech Geol Eng, 35(6): 2597–2611
36 B T Pham, D Tien Bui, H V Pham, H Q Le, I Prakash, M B Dholakia (2016). Landslide hazard assessment using random subspace fuzzy rules based classifier ensemble and probability analysis of rainfall data: a case study at Mu Cang Chai District, Yen Bai Province (Viet Nam). J In Soc of Remote Sensing, 45(4): 673–683
37 C L Philip Chen, C Y Zhang (2014). Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci, 275: 314–347
38 T Romary, J Rivoirard, J Deraisme (2015). Unsupervised classification of multivariate geostatistical data: two algorithms. Comput Geosci, 85: 96–103
39 J W Sammon (1969). A nonlinear mapping for data structure analysis. IEEE Trans Comput, 18: 401–409
40 A Singh, A Yadav, A Rana (2013). K-means with three different distance metrics. Int J Comput Appl, 67(10): 13–17
41 A Srivastava, R Nemani, K Steinhaeuser (2017). Large-Scale Machine Learning in the Earth Sciences. New York: Chapman and Hall/CRC
42 K C Tse, H C Chiu, M Y Tsang, Y Li, E Y Lam (2019). Unsupervised learning on scientific ocean drilling datasets from the South China Sea. Front Earth Sci, 13(1): 180–190
43 K L Wagstaff (2012). Proceedings of the 29th international conference on machine learning. San Francisco: California Institute of Technology
44 P Wang, P Blum, et al. (2000). 2000 Proceedings of the Ocean Drilling Program, Initial Reports, Vol. 184. Initial Reports. New York: ODP Press
45 P Wang, Q Li (2009). The South China Sea–paleoceanography and sedimentology. In: The South China Sea–Paleoceanography and Sedimentology. Berlin: Springer
46 M J Way, J D Scargle, K M Ali, A N Srivastava (2012). Advances in Machine Learning and Data Mining for Astronomy. New York: CRC Press
47 J M Whitman, T A Davies (1979). Cenozoic oceanic sedimentation rates: How good are the data? Mar Geol, 30(34): 269–284
48 R Williams (2011). Earth Science: New Methods and Studies. London: Apple Academic Press
49 P J Wolfe (2013). Making sense of big data. Proc Natl Acad Sci USA, 110(45): 18031–18032
[1] Kevin C. TSE, Hon-Chim CHIU, Man-Yin TSANG, Yiliang LI, Edmund Y. LAM. Unsupervised learning on scientific ocean drilling datasets from the South China Sea[J]. Front. Earth Sci., 2019, 13(1): 180-190.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed