Accepted Papers

IDA is pleased to announce this year’s accepted papers. We had a total of 75 submissions out of which 15 were accepted as regular papers for oral presentation, 17 were accepted as regular papers for poster presentation, and 7 were accepted at the Industrial Challenge track. In addition, we have accepted two Horizon Track abstracts.


Regular Papers


  • Syed Murtaza Hassan, Luis Moreira-Matias, Jihed Khiari and Oded Cats. Feature Selection Issues in Long-Term Travel Time Prediction 

Abstract: Long-term travel time predictions are crucial for tactical and operational public transport planning in schedule design and resource allocation tasks. Similarly to any regression task, its success considerably depend~s on an adequate feature selection framework. In this paper, we approach the myopia of the State-of-the-Art method RReliefF on mining relevant inter-relationships of the feature space for reducing the entropy around the target variable. A comparative study was conducted using baseline regression methods and LASSO as a valid alternative to RReliefF. Experimental results obtained on a real-world case study running in Sweden uncovered the bias/variance reduction obtained by each approach, pointing out promising ideas on this research line.

  • Petr Rysavy and Filip Zelezny. Estimating Sequence Similarity from Read Sets for Clustering Sequencing Data

Abstract: Clustering biological sequences is a central task in bioinformatics. The typical result of new-generation sequencers is a set of short substrings (“reads”) of a target sequence, rather than the sequence itself. To cluster sequences given only their read-set representations, one may try to reconstruct each one from the corresponding read set, and then employ conventional (dis)similarity mesaures such as the edit distance on the assembled sequences. This approach is however problematic and we propose instead to estimate the similarities directly from the read sets. Our approach is based on an adaptation of the Monge-Elkan similarity known from the field of databases. It avoids the NP-hard problem of sequence assembly and in empirical experiments it results in a better approximation of the true sequence similarities and consequently in better clustering, in comparison to the first-assemble-then-cluster approach.

  • Gabriella Contardo, Ludovic Denoyer and Thierry Artieres. Sequential Cost-Sensitive Feature Acquisition

Abstract:  We propose a reinforcement learning based approach to tackle the cost-sensitive learning problem where each input features has a specific cost. The acquisition process is handled through a stochastic policy which allows features to be acquired in an adaptive way. The general architecture of our approach relies on representation learning to enable performing prediction on any partially observed sample, whatever the set of its observed features are. The resulting model is an original mix of representation learning and of reinforcement learning ideas. It is learned with policy gradient techniques to minimize a budgeted inference cost. We demonstrate the effectiveness of our proposed method with several experiments on a variety of datasets for the sparse prediction problem where all features have the same cost, but also for some cost-sensitive settings.

  • Jesper van Engelen, Hanjo Boekhout and Frank Takes. Explainable and Efficient Link Prediction in Real-World Network Data

Abstract: Data that involves some sort of relationship or interaction can be represented, modelled and analyzed using the notion of a network. To understand the dynamics of networks, the link prediction problem is concerned with predicting the evolution of the topology of a network over time. Previous work in this direction has largely focussed on finding an extensive set of features capable of predicting the formation of a link, often within some domain-specific context. This sometimes results in a “black box” type of approach in which it is unclear how the (often computationally expensive) features contribute to the accuracy of the final predictor. This paper counters these problems by categorising the large set of proposed link prediction features based on their topological scope, and showing that the contribution of particular categories of features can actually be explained by simple structural properties of the network. A novel approach called the Efficient Feature Set is presented that uses a limited but explainable set of computationally efficient features that within each scope captures the essential network properties. The performance of the proposed approach is experimentally verified using a large number of diverse real-world network datasets. The result is a generic approach suitable for consistently predicting links with high accuracy across a variety of real-world networks.

  • Kata Gábor, Haïfa Zargayouna, Isabelle Tellier, Davide Buscaldi and Thierry Charnois. Unsupervised Relation Extraction in Specialized Corpora using Sequence Mining

Abstract: This paper deals with the extraction of semantic relations from scientific texts. Pattern-based representations are compared to word embeddings in unsupervised clustering experiments, according to their potential to discover new types of semantic relations and recognize their instances. The results indicate that sequential pattern mining can significantly improve pattern-based representations, even in a completely unsupervised setting.

  • Tobias Sobek and Frank Höppner. Visual Perception of Discriminative Landmarks in Classified Time Series

Abstract: Distance measures play a central role for time series data. Such measures condense two complex structures into a convenient, single number — at the cost of loosing many details. This might become a problem when the series are in general quite similar to each other and series from different classes differ only in details. This work aims at supporting an analyst in the explorative data understanding phase, where she wants to get an impression of how time series from different classes compare. Based on the interval tree of scales, we develop a visualisation that draws the attention of the analyst immediately to those details of a time series that are representative or discriminative for the class. The visualisation adopts to the human perception of a time series by addressing the persistence and distinctiveness of landmarks in the series.

  • Nasser Davarzani, Ralf Peeters, Evgueni Smirnov, Joel Karel and Hans-Peter Brunner-La Rocca. Ranking Accuracy for Logistic-GEE models

Abstract: The logistic Generalized Estimating Equations (logistic-GEE) models have been extensively used for analyzing clustered binary data. However, assessing the goodness-of-fit and predictability of the logistic-GEE models is problematic due to the fact that no likelihood is available and the observations are correlated within a cluster. In this paper we propose a new measure for estimating the generalization performance of the logistic GEE models, namely ranking accuracy for models based on clustered data (RAMCD). We define RAMCD as the probability that a randomly selected positive observation is ranked higher than randomly selected negative observation from another cluster. We propose a computationally efficient algorithm for RAMCD. The algorithm can be applied for two cases: (1) when we estimate RAMCD as a goodness-of-fit criterion and (2) when we estimate RAMCD as a predictability criterion. This is experimentally shown on clustered data from a simulation study and a pharmaceutical study.

  • David Weston. A Framework for Interpolating Scattered Data using Space-filling Curves

Abstract: The analysis of spatial data occurs in many disciplines and covers a wide variety activities. Available techniques for such analysis include spatial interpolation which is useful for tasks such as visualization and imputation. This paper proposes a novel approach to interpolation using space-filling curves. Two simple interpolation methods are described and their ability to interpolate is compared to several interpolation techniques including Natural Neighbour interpolation. The proposed approach requires a Monte-Carlo step that requires a large number of iterations. However experiments demonstrate that the number of iterations will not change appreciably with larger datasets.

  • Andreas Weiler, Joeran Beel, Bela Gipp, and Michael Grossniklaus. Stability Evaluation of Event Detection Techniques for Twitter

Abstract: Twitter continues to gain popularity as a source of up-to-date news and information. As a result, numerous event detection techniques have been proposed to cope with the steadily increasing rate and volume of social media data streams. Although most of these works conduct some evaluation of the proposed technique, comparing their effectiveness is a challenging task. In this paper, we examine the challenges to reproducing evaluation results for event detection techniques. We apply several event detection techniques and vary four parameters, namely time window (15 vs. 30 vs. 60 mins), stopwords (include vs. exclude), retweets (include vs. exclude), and the number of terms that define an event (1…5 terms). Our experiments use real-world Twitter streaming data and show that varying these parameters alone significantly influences the outcomes of the event detection techniques, sometimes in unforeseen ways. We conclude that even minor variations in event detection techniques may lead to major difficulties in reproducing experiments.

  • Joshua Garland, Tyler R. Jones, Elizabeth Bradley, Ryan G. James and James W. C. White. A First Step Toward Quantifying the Climate’s Information Production Over the Last 68,000 Years

Abstract: Paleoclimate records are extremely rich sources of information aboutthe past history of the Earth system. We take aninformation-theoretic approach to analyzing data from the WAIS Divideice core, the longest continuous and highest-resolution water isotoperecord yet recovered from Antarctica. We use weighted permutationentropy to calculate the Shannon entropy rate from these isotopemeasurements, which are proxies for a number of different climatevariables, including the temperature at the time of deposition of thecorresponding layer of the core. We find that the rate of informationproduction in these measurements reveals issues with analysisinstruments, even when those issues leave no visible traces in the rawdata. These entropy calculations also allow us to identify a numberof intervals in the data that may be of direct relevance topaleoclimate interpretation, and to form new conjectures about what ishappening in those intervals—including periods of abrupt climate change.

  • James Brofos, Rui Shu and Frank Zhang. The Optimistic Method for Model Estimation

Abstract: We present the method of optimistic estimation, a novel paradigm that seeks to incorporate robustness to errors-in-variables biases directly into the estimation objective function. This approach protects parameter estimates in statistical models from data set corruption. We apply the optimistic paradigm to estimation of linear regression, logistic regression, and Ising graphical models in the presence of noise and demonstrate that more accurate predictions of the model parameters can be obtained.

  • Martijn Post, Peter van der Putten and Jan N. van Rijn. Does Feature Selection Improve Classification? A Large Scale Experiment in OpenML

Abstract: It is often claimed that data pre-processing is an important factor contributing towards the performance of classification algorithms. In this paper we investigate feature selection, a common data pre-processing technique. We conduct a large scale experiment and present results on what type of algorithms and datasets typically benefit from this technique. Using meta-learning we can find out for which combinations this is the case. In addition to a vast set of meta-features, we introduce the Feature Selection Landmarkers, that proof useful for this task. All our experimental results are made publicly available on OpenML.

  • Julien Ah-Pine and Xinyu Wang. Similarity based hierarchical clustering with an application to text collections

Abstract: Lance-Williams formula is a framework that unifies seven schemes of agglomerative hierarchical clustering. In this paper, we establish a new expression of this formula using inner product based similarities instead of distances. We state sufficient conditions under which the new formula is equivalent to the original one. The interest of our approach is twofold. First, we can naturally extend agglomerative hierarchical clustering techniques to kernel functions. Second, reasoning in terms of inner product allows us to design thresholding strategies on proximity values. Thereby, we propose to sparsify the similarity matrix in the goal of making these clustering techniques more efficient. We apply our approach to text clustering tasks. Our results show that sparsifying the inner products matrix considerably decreases memory usage and shortens running time while assuring the clustering quality.

  • Antonio Neme and Omar Neme. Vote buying detection via independent component analysis.

Abstract: Electoral fraud can be commited along several stages. Different tools have been applied to detect the existence of such undesired actions. One particular undesired activity is that of vote-buying. It can be though of as an economical influence of a candidate over voters that in other circumstances could had decided to vote for a dierent candidate, or not no vote at all. Instead, under this influence, some citizens cast their votes for the suspiscious candidate. We propose in this contribution that intelligent data analysis tools can be of help in the identication of this undesired behavior. We think of the results obtained in the aected ballots as a mixture of two signals. The rst signal is the number of votes for the suspiscious candidate, which includes his/her actual supporters and the voters aected by an economic influence. The second mixed signal is the number of citizens that did not vote, which is aected also by the bribes or economic incentives. These assumptions allows us to apply an instance of blind source separation, independent component analysis, in order to reconstruct the original signals, namely, the actual number of voters the candidate may have had and the actual number of no voters. As a case of study we applied the proposed methodology to the case of presidential elections in Mexico in 2012, obtained by analyzing public data. Our results are consistent with the ndings of inconsistencies through other electoral forensic means.

  • Eduardo Kamioka, Frederico Caroli, André Freitas and Siegfried Handschuh. Determining Data Relevance using Semantic Types and Graphical Interpretation Cues
Abstract: The increasing volume of data generated and the shortage of professionals trained to extract value from it, raises a question of how to automate data analysis processes. This work investigates how to increase the automation in the data interpretation process by proposing a relevance classification heuristic model, which can be used to express which views over the data are potentially meaningful and relevant. The relevance classification model uses the combination of semantic types derived from the data attributes and visual human interpretation cues as input features. The evaluation shows the impact of these features in improving the prediction of data relevance, where the best classification model achieves a F1 score of 0.906.

Regular Posters


  • Daoyuan Li, Tegawendé F. Bissyandé, Jacques Klein and Yves Le Traon. DSCo-NG: A Practical Language Modeling Approach for Time Series Classification

Abstract: The abundance of time series data in various domains and their high dimensionality characteristic are challenging for harvesting useful information from them. To tackle storage and processing challenges, compression-based techniques have been proposed. Our previous work, Domain Series Corpus (DSCo), compresses time series into symbolic strings and takes advantage of language modeling techniques to extract from the training set knowledge about different classes. However, this approach was flawed in practice due to its excessive memory usage and the need for a priori knowledge about the dataset. In this paper we propose DSCo-NG, which reduces DSCo’s complexity and offers an efficient (linear time complexity and low memory footprint), accurate (performance comparable to approaches working on uncompressed data) and generic (so that it can be applied to various domains) approach for time series classification. Our confidence is backed with extensive experimental evaluation against publicly accessible datasets, which also offers insights on when DSCo-NG can be a better choice than others.

  • Livia Teernstra, Peter van der Putten, Liesbeth Noordegraaf-Eelens and Fons Verbeek. The Morality Machine: Tracking Moral Values in Tweets

Abstract: This paper introduces The Morality Machine, a system that tracks ethical sentiment in Twitter discussions. Empirical approaches to ethics are rare, and to our knowledge this system is the first to take a machine learning approach. It is based on Moral Foundations Theory, a framework of moral values that are assumed to be universal. Carefully handcrafted keyword dictionaries for Moral Foundations Theory exist, but experiments demonstrate that models that do not leverage these have similar or superior performance, thus proving the value of a more pure machine learning approach.

  • Mouna Ben Ishak, Philippe Leray and Nahla Ben Amor. A hybrid approach for Probabilistic Relational Models structure learning

Abstract: Probabilistic relational models (PRMs) extend Bayesian networks (BNs) to a relational data mining context. Just like BNs, the structure and parameters of a PRM must be either set by an expert or learned from data. Learning the structure remains the most complicated issue as it is a NP-hard problem. Existing approaches for PRM structure learning are inspired from classical methods of learning the BN structure. Extensions for the constraint-based and score-based methods have been proposed. However, hybrid methods are not yet adapted to relational domains, although some of them show better experimental performance, in the classical context, than constraint-based and score-based methods, such as the Max-Min Hill Climbing ($MMHC$) algorithm. In this paper, we present an adaptation of this latter to relational domains and we made an empirical evaluation of our algorithm. We provide an experimental study where we compare our new approach to the state-of-the art relational structure learning algorithms.

  • Deepak Soekhoe, Peter van der Putten and Aske Plaat. On the Impact of data set Size in Transfer Learning using Deep Neural Networks

Abstract: In this we paper we study the effect of target set size on transfer learning in deep learning convolutional neural networks. This is an important problem as labelling is a costly task, or for new or specific classes the number of labelled instances available may simply be too small. We present results for a series of experiments where we either train on a target of classes from scratch, retrain all layers, or subsequently lock more layers in the network, for the Tiny-ImageNet and MiniPlaces2 data sets. Our findings indicate that for smaller target data sets freezing the weights for the initial layers of the network gives better results on the target set classes. We present a simple and easy to implement training heuristic based on these findings.

  • Christian Braune, Marco Dankel, and Rudolf Kruse. Obtaining Shape Descriptors from a Concave Hull-Based Clustering Algorithm

Abstract: In data analysis clustering is one of the core processes to find groups in otherwise unstructured data. Determining the number of clusters or finding clusters of arbitrary shape whose convex hulls overlap is in general a hard problem. In this paper we present a method for clustering data points by iteratively shrinking the convex hull of the data set. Subdividing the created hulls leads to shape descriptors of the individual clusters. We tested our algorithm on several data sets and achieved high degrees of accuracy. The cluster definition employed uses a notion of spatial separation. We also compare our algorithm against a similar algorithm that automatically detects the boundaries and the number of clusters. The experiments show that our algorithm yields better results.

  • Fabio Del Vigna, Marco Avvenuti, Clara Bacciu, Paolo Deluca, Marinella Petrocchi, Andrea Marchetti and Maurizio Tesconi. Spotting the diffusion of New Psychoactive Substances over the Internet

Abstract: Online availability and diffusion of Novel Psychoactive Sub- stances (NPS) represent an emerging threat to healthcare systems. In this work, we analyse drugs forums, online shops, and Twitter. By mining the data from these sources, it is possible to understand the dynamics of drugs diffusion and their endorsement, as well as timely detecting new substances. We propose a set of visual analytics tools to support analysts in tackling NPS spreading and provide a better insight about drugs market and analysis.

  • Gianni Costa and Riccardo Ortale. A Mean-Field Variational Bayesian Approach to Detecting Overlapping Communities with Inner Roles using Poisson Link Generation

Abstract: A novel model-based machine-learning approach is presented for the unsupervised and exploratory analysis of node aliations to over-lapping communities with roles in networks. At the heart of our approach is a new Bayesian probabilistic generative model of directed networks, that treats roles as abstract behavioral classes explaining node linking behavior. A generalized weighted instance of directed aliation modeling rules the strength of node participation in communities with whichever role through Gamma priors. Moreover, link establishment between nodes is governed by a Poisson distribution. The latter is parameterized so that, the stronger the aliations of two nodes to common communities with respective roles, the more likely it is the formation of a connection. A coordinate-ascent algorithm is designed to implement mean-eld variational inference for aliation analysis and link prediction. A comparative evaluation on real-world networks demonstrates the superiority of our approach in community compactness, link prediction and scalability.

  • Ricardo Sousa and João Gama. Online semi-supervised learning for multi-target regression in data streams using AMRules

Abstract: Most data streams systems that use online Multi-target regression yield vast amounts of data which is not targeted. Targeting this data is usually impossible, time consuming and expensive. Semi-supervised algorithms have been proposed to use this untargeted data (input information only) for model improvement. However, most algorithms are adapted to work on batch mode for classification and require huge computational and memory resources. Therefore, this paper proposes an semi-supervised algorithm for online processing systems based on AMRules algorithm that handle both targeted and untargeted data and improves the regression model. The proposed method was evaluated through a comparison between a scenario where the untargeted examples are not used on the training and a scenario where some untargeted examples are used. Evaluation results indicate that the use of the untargeted examples improved the target predictions by improving the model.

  • Jim O’ Donoghue and Mark Roantree. A Toolkit for Analysis of Deep Learning Experiments

Abstract: Learning experiments are complex procedures which generate high volumes of data due to the number of updates which occur during training and the number of trials necessary for hyper-parameter selection. Often during runtime, interim result data is purged as the experiment progresses. This purge makes rolling-back to interim experiments, restarting at a specific point or discovering trends and patterns in parameters, hyper-parameters or results almost impossible given a large experiment or experiment set. In this research, we present a data model which captures all aspects of a deep learning experiment and through an application programming interface provides a simple means of storing, retrieving and analysing parameter settings and interim results at any point in the experiment. This has the further benefit of a high level of interoperability and sharing across machine learning researchers who can use the model and its interface for data management.

  • Pedro Saleiro and Carlos Soares. Learning from the News: Predicting Entity Popularity on Twitter

Abstract: Everyday millions of tweets are generated about global and local news, including people’s reactions and opinions regarding the events displayed on those news stories. Entities play a central role in the interplay between social networks and online news. When sharing or commenting news on social networks, users tend to mention the most predominant entities mentioned in the news story. Therefore, entity popularity on social networks is an important metric for online reputation monitoring systems. In this work, we tackle the problem of predicting future entity popularity on Twitter by relying solely on information extracted from the news cycle. We apply a supervised learning approach and extract four types of features: (i) signal, (ii) textual, (iii) sentiment and (iv) semantic, which we use to predict whether the popularity of a given entity will be high or low in the following hours. We run several experiments on six different entities in a dataset of over 150M tweets and 5M news and obtained F1 scores over 0.70. Error analysis show that news perform better on predicting entity popularity on Twitter when serving as primary information source of the event, in opposition with events like TV live broadcasts, political debates or football matches.

  • Sami Dhahbi, Walid Barhoumi and Ezzeddine Zagrouba. Multi-scale kernel PCA and its application to curvelet-based feature extraction for mammographic mass characterization

Abstract: Accurate characterization of mammographic masses plays a key role in effective mammogram classification and retrieval. Because of their high performance in multi-resolution texture analysis, several curvelet-based features have been proposed to describe mammograms, but without satisfactory results in distinguishing between malignant and benign masses. This paper tackles the problem of extracting a reduced set of discriminative curvelet texture features for mammographic mass characterization. The contribution of this paper is twofold. First, to overcome the weakness of PCA to cope with the nonlinearity of curvelet coefficient distributions, we investigate the use of kernel principal components analysis (KPCA) with a Gaussian kernel over curvelet coefficients for mammogram characterization. Second, a new multi-scale Gaussian kernel is introduced to overcome the shortcoming of single Gaussian kernels. Indeed, giving that faraway points may contain useful information for mammogram characterization, the kernel must emphasis neighbor points without neglecting faraway ones. Gaussian kernels either fail to emphasis neighborhood (high sigma values) or ignore faraway points (low sigma values).To emphasis neighborhood without neglecting faraway points, we propose to use a linear combination of Gaussian kernels with several sigma values, as a kernel in KPCA. Experiments performed on the DDSM database showed that KPCA outperforms state-of-the-art curvelet-based methods including PCA and moments and that the multi-scale gaussian kernel outperforms single gaussian kernels.

  • Pierre Holat, Nadi Tomeh, Thierry Charnois, Delphine Battistelli, Marie-Christine Jaulent and Jean-Philippe Métivier. Weakly-supervised Symptom Recognition for Rare Diseases in Biomedical Text

Abstract: In this paper, we tackle the issue of symptom recognition for rare diseases in biomedical texts. Symptoms typically have more complex and ambiguous structure than other biomedical named entities. Furthermore, existing resources are scarce and incomplete. Therefore, we propose a weakly-supervised framework based on a combination of two approaches: sequential pattern mining under constraints and sequence labeling. We use unannotated biomedical paper abstracts with dictionaries of rare diseases and symptoms to create our training data.Our experiments show that both approaches outperform simple projection of the dictionaries on text, and their combination is beneficial.We also introduce a novel pattern mining constraint based on semantic similarity between words inside patterns.

  • Oliver Sampson and Michael Berthold. Widened Learning of Bayesian Network Classifiers

Abstract: We demonstrate the application of Widening to learning performant Bayesian Networks for use as classifiers. Widening is a framework for utilizing parallel resources and diversity to find models in a solution space that are potentially better than a standard greedy algorithm. This work demonstrates that widened learning of Bayesian Networks, using the Frobenius Norm of the networks’ graph Laplacian matrices as a distance measure, can create Bayesian networks that are better classifiers than those generated by popular Bayesian Network algorithms.

  • Josenildo Silva, Matthias Klusch and Stefano Lodi. Privacy-Awareness of Distributed Data Clustering Algorithms Revisited

Abstract: Several privacy measures have been proposed in the privacy preserving data mining literature. However, privacy measures either assume centralized data source or that no insider is going to try to infer some information. This paper presents distributed privacy measures that take into account collusion attacks and point level breaches for distributed data clustering. An analysis of representative distributed data clustering algorithms show that collusion is an important source of privacy issues and that the analyzed algorithms exhibit different vulnerabilities to collusion groups.

  • Labiod Lazhar and Nadif Mohamed. Bi-stochastic matrix approximation framework for data co-clustering

Abstract: The matrix approximation approaches like Singular Value Decomposition SVD and Non-negative Matrix Tri-Factorization (NMTF) have recently been shown to be useful and eective to tackle the co-clustering problem. In this work, we embed the co-clustering in a Bistochastic Matrix Approximation (BMA) framework and we derive from the double kmeans objective function a new formulation of the criterion to optimize. First, we show that the double k-means is equivalent to algebraic problem of BMA under some suitable constraints. Secondly, we propose an iterative process seeking for the optimal simultaneous partitions of rows and columns data, the solution is given as the steady state of a markov chain process. We develop two iterative algorithms; the first consists in learning rows and columns similarities matrices and the second consists in obtaining the simultaneous rows and columns partitions. Numerical experiments on simulated and real datasets demonstrate the interest of our approach which does not require the knowledge of the number of co-clusters.

  • Karel Vaculik and Lubos Popelinsky. DGRMiner: Anomaly Detection and Explanation in Dynamic Graphs

Abstract: Ubiquitous network data has given rise to diverse graph mining and analytical methods. One of the graph mining domains is anomaly detection in dynamic graphs, which can be employed for fraud detection, network  intrusion detection, suspicious behaviour identification, etc. Most existing methods search for anomalies rather on the global level of the graphs. In this work, we propose a new anomaly detection and explanation algorithm for dynamic graphs. The algorithm searches for anomaly patterns in the form of predictive rules that enable us to examine the evolution of dynamic graphs on the level of subgraphs. Specifically, these patterns are able to capture addition and deletion of vertices and edges, and relabeling of vertices and edges. In addition, the algorithm outputs normal patterns that serve as an explanation for the anomaly patterns. The algorithm has been evaluated on two real-world datasets.

  • Frank Klawonn, Junxi Wang, Ina Koch, Jörg Eberhard and Mohamed Omar. HAUCA Curves for the Evaluation of Biomarker Pilot Studies with Small Sample Sizes and Large Numbers of Features

Abstract: Biomarker studies often try to identify a combination of measured attributes to support the diagnosis of a specific disease. Measured values are commonly gained from high-throughput technologies like next generation sequencing leading to an abundance of biomarker candidates compared to the often very small sample size. Here we use an example with more than 50,000 biomarker candidates that we want to evaluate based on a sample of only 24 patients. This seems to be an impossible task and finding purely random-based correlations is guaranteed. Although we cannot to identify specific biomarkers in such small pilot studies with purely statistical methods, one can still derive whether there are more biomarkers showing a high correlation with the disease under consideration than one would expect in a setting where correlations are purely random. We propose a method based on area under the ROC curve (AUC) values that indicates how much correlations of the biomarkers with the disease of interest exceed pure random effects. We also provide estimations of sample sizes for follow-up studies to actually identify concrete biomarkers and build classifiers for the disease. We also describe how our method can be extended to other performance measures than AUC.


Industrial Challenge Papers


  • Camila Ferreira Costa and Mario A. Nascimento. Using Machine Learning for Predicting Failures

Abstract: This paper presents solutions to the IDA 2016 Industrial Challenge which consists of designing a model that is able to predict whether a specific component of the air pressure System of a vehicle faces imminent failure. This problem is modelled as a classification problem, since the goal is to determine if an unobserved instance represents a failure or not. We evaluate various state-of-the-art classification algorithms and investigate how to deal with the imbalanced dataset and with the high amount of missing data. Our experiments showed that the best classifier was cost-wise 92.56% better than a baseline solution where a random classification is performed.

  • Vitor Cerqueira, Fábio Pinto, Cláudio Sá and Carlos Soares. Combining Boosted Trees with Metafeature Engineering for Predictive Maintenance

Abstract: We describe a data mining workflow for predictive maintenance of the Air Pressure System in heavy trucks. Our approach is composed by four steps steps: (i) a filter that excludes a subset of features and examples based on the number of missing values (ii) a metafeatures engineering procedure used to create a meta-level features set with the goal of increasing the information on the original data; (iii) a biased sampling method to deal with the class imbalance problem; and (iv) boosted trees to learn the target concept. Results show that the metafeatures engineering and the biased sampling method are critical for improving the performance of the classifier.

  • Ezgi Can Ozan, Ekaterina Riabchenko, Serkan Kiranyaz and Moncef Gabbouj. An Optimized k-NN Approach for Classification on Imbalanced Datasets with Missing Data

Abstract: In this paper, we describe our solution for the machine learning prediction challenge in IDA 2016. For the given problem of 2-class classification on an imbalanced dataset with missing data, we first develop an imputation method based on k-NN to estimate the missing values. Then we define a tailored representation for the given problem as an optimization scheme, which consists of learned distance and voting weights for k-NN classification. The proposed solution performs better in terms of the given challenge metric compared to the traditional classification methods such as SVM, AdaBoost or Random Forests.


Horizon Track Talks


  • Devdatt Dubhashi. Cognitive Computing for the Automated Society

Abstract: Future projections at the recent World Economic forum at Davos and also at the Almedal seminars in Sweden have focussed on the promises and challenges of the fast approaching automated society. Robots and other autonomous systems are taking over large segments of the economy and society. Autonomous driving and the so-called Manufacturing 4,0 are already making visible impact. One of the key technologies that will shape this future is cognitive computing i.e. “smart systems that learn at scale, reason with purpose and interact with humans naturally.” to enable  seamless human-computer interaction and collaboration. A central component of cognitive computing systems that will be ubiquitous  in all application domains  is natural language understanding. Recent advances in machine learning have seen dramatic advances in natural language technologies. These have involved multi-modal methods that integrate and analyze  heterogeneous data sources including text, image and dialogue. Deep learning in particular has achieved remarkable success in these areas. We will give an overview of the state of the art in these technologies and give examples from our own research.

  • Daniel Gillblad. Usable analytics at societal scale

Abstract: While we are rapidly deploying increasingly advanced analytics and machine learning techniques on larger and larger data sets, the development efforts needed along with the scale and complexity of solutions often severely limit the actual usefulness of the analytics results. In this talk, we will outline what we at SICS, the Swedish Institute of Computer Science, see as the most important research areas and directions to develop useful, practical analytics solutions at societal scale, from computational models and platforms to the machine learning and AI technologies running on top of them. We will take examples from current research on computational frameworks, machine learning, and their applications.