|
Stratified sampling for data mining on the deep web
Tantan LIU, Fan WANG, Gagan AGRAWAL
Front Comput Sci. 2012, 6 (2): 179-196.
https://doi.org/10.1007/s11704-012-2859-3
In recent years, the deep web has become extremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this paper, we target two related data mining problems, association mining and differential rulemining. These are proposed to extract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively processes the query space of a deep web data source, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.
References |
Related Articles |
Metrics
|
|
Managing advertising campaigns—an approximate planning approach
Sertan GIRGIN, Jérémie MARY, Philippe PREUX, Olivier NICOL
Front Comput Sci. 2012, 6 (2): 209-229.
https://doi.org/10.1007/s11704-012-2873-5
We consider the problem of displaying commercial advertisements on web pages, in the “cost per click” model. The advertisement server has to learn the appeal of each type of visitor for the different advertisements in order to maximize the profit. Advertisements have constraints such as a certain number of clicks to draw, as well as a lifetime. This problem is thus inherently dynamic, and intimately combines combinatorial and statistical issues. To set the stage, it is also noteworthy that we deal with very rare events of interest, since the base probability of one click is in the order of 10-4. Different approaches may be thought of, ranging from computationally demanding ones (use of Markov decision processes, or stochastic programming) to very fast ones.We introduce NOSEED, an adaptive policy learning algorithm based on a combination of linear programming and multi-arm bandits. We also propose a way to evaluate the extent to which we have to handle the constraints (which is directly related to the computation cost). We investigate the performance of our system through simulations on a realistic model designed with an important commercial web actor.
References |
Related Articles |
Metrics
|
|
An approach for automatic sleep stage scoring and apnea-hypopnea detection
Tim SCHLüTER, Stefan CONRAD
Front Comput Sci. 2012, 6 (2): 230-241.
https://doi.org/10.1007/s11704-012-2872-6
In this article we present an application of data mining to the medical domain sleep research, an approach for automatic sleep stage scoring and apnea-hypopnea detection. By several combined techniques (Fourier and wavelet transform, derivative dynamic time warping, and waveform recognition), our approach extracts meaningful features (frequencies and special patterns like k-complexes and sleep spindles) from physiological recordings containing EEG, ECG, EOG and EMG data. Based on these pieces of information, an ensemble of decision trees is constructed using the principle of bagging, which classifies sleep epochs in their sleep stages according to the rules by Rechtschaffen and Kales and annotates occurrences of apnea-hypopnea (total or partial cessation of respiration). After that, casebased reasoning is applied in order to improve quality. We tested and evaluated our approach on several large public databases from PhysioBank, which showed an overall accuracy of 95.2% for sleep stage scoring and 94.5% for classifying minutes as apneic or non-apneic.
References |
Related Articles |
Metrics
|
8 articles
|