Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

   Online First

Administered by

, Volume 6 Issue 2

For Selected: View Abstracts Toggle Thumbnails
RESEARCH ARTICLE
Multi-stream join answering for mining significant cross-stream correlations
Robert GWADERA
Front Comput Sci. 2012, 6 (2): 131-142.  
https://doi.org/10.1007/s11704-012-2862-8

Abstract   HTML   PDF (684KB)

Sliding-window multi-stream join (SWMJ) is a fundamental operation for correlating information from different streams. We provide a solution to the problem of assessing significance of the SWMJ result by focusing on the relative frequency of windows satisfying a given equijoin predicate as the most important parameter of the SWMJ result. In particular, we derive a formula for computing the expected relative frequency of windows satisfying a given equijoin predicate that can be evaluated in quadratic time in the window size given a proposed probabilistic model of the multi-stream. In experiments conducted on a daily rainfall data set we demonstrate the remarkable accuracy of our method, which confirms our theoretical analysis.

References | Related Articles | Metrics
Active improvement of hierarchical object features under budget constraints
Nicolas CEBRON
Front Comput Sci. 2012, 6 (2): 143-153.  
https://doi.org/10.1007/s11704-012-2857-5

Abstract   HTML   PDF (657KB)

When we think of an object in a supervised learning setting, we usually perceive it as a collection of fixed attribute values. Although this setting may be suited well for many classification tasks, we propose a new object representation and therewith a new challenge in data mining; an object is no longer described by one set of attributes but is represented in a hierarchy of attribute sets in different levels of quality. Obtaining a more detailed representation of an object comes with a cost. This raises the interesting question of which objects we want to enhance under a given budget and cost model. This new setting is very useful whenever resources like computing power, memory or time are limited. We propose a new active adaptive algorithm (AAA) to improve objects in an iterative fashion. We demonstrate how to create a hierarchical object representation and prove the effectiveness of our new selection algorithm on these datasets.

References | Related Articles | Metrics
An effective framework for characterizing rare categories
Jingrui HE, Hanghang TONG, Jaime CARBONELL
Front Comput Sci. 2012, 6 (2): 154-165.  
https://doi.org/11.1007/s11704-012-2861-9

Abstract   HTML   PDF (830KB)

Rare categories become more and more abundant and their characterization has received little attention thus far. Fraudulent banking transactions, network intrusions, and rare diseases are examples of rare classes whose detection and characterization are of high value. However, accurate characterization is challenging due to high-skewness and nonseparability from majority classes, e.g., fraudulent transactions masquerade as legitimate ones. This paper proposes the RACH algorithm by exploring the compactness property of the rare categories. This algorithm is semi-supervised in nature since it uses both labeled and unlabeled data. It is based on an optimization framework which encloses the rare examples by a minimum-radius hyperball. The framework is then converted into a convex optimization problem, which is in turn effectively solved in its dual form by the projected subgradient method. RACH can be naturally kernelized. Experimental results validate the effectiveness of RACH.

References | Related Articles | Metrics
Resilient k-d trees: k-means in space revisited
Fabian GIESEKE, Gabriel MORUZ, Jan VAHRENHOLD
Front Comput Sci. 2012, 6 (2): 166-178.  
https://doi.org/10.1007/s11704-012-2870-8

Abstract   HTML   PDF (618KB)

We propose a k-d tree variant that is resilient to a pre-described number of memory corruptions while still using only linear space. While the data structure is of independent interest, we demonstrate its use in the context of highradiation environments. Our experimental evaluation demonstrates that the resulting approach leads to a significantly higher resiliency rate compared to previous results. This is especially the case for large-scale multi-spectral satellite data, which renders the proposed approach well-suited to operate aboard today’s satellites.

References | Related Articles | Metrics
Stratified sampling for data mining on the deep web
Tantan LIU, Fan WANG, Gagan AGRAWAL
Front Comput Sci. 2012, 6 (2): 179-196.  
https://doi.org/10.1007/s11704-012-2859-3

Abstract   HTML   PDF (650KB)

In recent years, the deep web has become extremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this paper, we target two related data mining problems, association mining and differential rulemining. These are proposed to extract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively processes the query space of a deep web data source, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.

References | Related Articles | Metrics
Two of a kind or the ratings game? Adaptive pairwise preferences and latent factor models
Suhrid BALAKRISHNAN, Sumit CHOPRA
Front Comput Sci. 2012, 6 (2): 197-208.  
https://doi.org/10.1007/s11704-012-2871-7

Abstract   HTML   PDF (388KB)

Latent factor models have become a workhorse for a large number of recommender systems.While these systems are built using ratings data, which is typically assumed static, the ability to incorporate different kinds of subsequent user feedback is an important asset. For instance, the user might want to provide additional information to the system in order to improve his personal recommendations. To this end, we examine a novel scheme for efficiently learning (or refining) user parameters from such feedback. We propose a scheme where users are presented with a sequence of pairwise preference questions: “Do you prefer item A over B?” User parameters are updated based on their response, and subsequent questions are chosen adaptively after incorporating the feedback. We operate in a Bayesian framework and the choice of questions is based on an information gain criterion. We validate the scheme on the Netflix movie ratings data set and a proprietary television viewership data set. A user study and automated experiments validate our findings.

References | Related Articles | Metrics
Managing advertising campaigns—an approximate planning approach
Sertan GIRGIN, Jérémie MARY, Philippe PREUX, Olivier NICOL
Front Comput Sci. 2012, 6 (2): 209-229.  
https://doi.org/10.1007/s11704-012-2873-5

Abstract   HTML   PDF (999KB)

We consider the problem of displaying commercial advertisements on web pages, in the “cost per click” model. The advertisement server has to learn the appeal of each type of visitor for the different advertisements in order to maximize the profit. Advertisements have constraints such as a certain number of clicks to draw, as well as a lifetime. This problem is thus inherently dynamic, and intimately combines combinatorial and statistical issues. To set the stage, it is also noteworthy that we deal with very rare events of interest, since the base probability of one click is in the order of 10-4. Different approaches may be thought of, ranging from computationally demanding ones (use of Markov decision processes, or stochastic programming) to very fast ones.We introduce NOSEED, an adaptive policy learning algorithm based on a combination of linear programming and multi-arm bandits. We also propose a way to evaluate the extent to which we have to handle the constraints (which is directly related to the computation cost). We investigate the performance of our system through simulations on a realistic model designed with an important commercial web actor.

References | Related Articles | Metrics
An approach for automatic sleep stage scoring and apnea-hypopnea detection
Tim SCHLüTER, Stefan CONRAD
Front Comput Sci. 2012, 6 (2): 230-241.  
https://doi.org/10.1007/s11704-012-2872-6

Abstract   HTML   PDF (407KB)

In this article we present an application of data mining to the medical domain sleep research, an approach for automatic sleep stage scoring and apnea-hypopnea detection. By several combined techniques (Fourier and wavelet transform, derivative dynamic time warping, and waveform recognition), our approach extracts meaningful features (frequencies and special patterns like k-complexes and sleep spindles) from physiological recordings containing EEG, ECG, EOG and EMG data. Based on these pieces of information, an ensemble of decision trees is constructed using the principle of bagging, which classifies sleep epochs in their sleep stages according to the rules by Rechtschaffen and Kales and annotates occurrences of apnea-hypopnea (total or partial cessation of respiration). After that, casebased reasoning is applied in order to improve quality. We tested and evaluated our approach on several large public databases from PhysioBank, which showed an overall accuracy of 95.2% for sleep stage scoring and 94.5% for classifying minutes as apneic or non-apneic.

References | Related Articles | Metrics
8 articles