Parameterized Complexity of Feature Selection for Categorical Data Clustering

Bandyapadhyay, Sayan; Fomin, Fedor; Golovach, Petr; Simonov, Kirill

dc.contributor.author	Bandyapadhyay, Sayan
dc.contributor.author	Fomin, Fedor
dc.contributor.author	Golovach, Petr
dc.contributor.author	Simonov, Kirill
dc.date.accessioned	2022-01-28T08:43:30Z
dc.date.available	2022-01-28T08:43:30Z
dc.date.created	2022-01-06T15:09:25Z
dc.date.issued	2021
dc.identifier.issn	1868-8969
dc.identifier.uri	https://hdl.handle.net/11250/2928922
dc.description.abstract	We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers l (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented by m-dimensional vectors whose elements belong to a finite set of values Σ), we want to select m-l relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (l0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters k, B, and \|Σ\|. Our main result is an algorithm that solves the Feature Selection problem in time f(k,B,\|Σ\|)⋅m^{g(k,\|Σ\|)}⋅n² for some functions f and g. In other words, the problem is fixed-parameter tractable parameterized by B when \|Σ\| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering, Binary and Boolean Low-rank Matrix Approximation with Outliers, and Binary Robust Projective Clustering. Thus as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.	en_US
dc.language.iso	eng	en_US
dc.publisher	Schloss Dagstuhl, Leibniz-Zentrum für Informatik	en_US
dc.relation.uri	https://arxiv.org/abs/2105.03753
dc.rights	Navngivelse 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/deed.no	*
dc.title	Parameterized Complexity of Feature Selection for Categorical Data Clustering	en_US
dc.type	Journal article	en_US
dc.type	Peer reviewed	en_US
dc.description.version	publishedVersion	en_US
dc.rights.holder	Copyright Sayan Bandyapadhyay, Fedor V. Fomin, Petr A. Golovach, and Kirill Simonov	en_US
dc.source.articlenumber	14	en_US
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	1
dc.identifier.doi	10.4230/LIPIcs.MFCS.2021.14
dc.identifier.cristin	1976054
dc.source.journal	Leibniz International Proceedings in Informatics	en_US
dc.relation.project	Norges forskningsråd: 263317	en_US
dc.identifier.citation	Leibniz International Proceedings in Informatics. 2021, 202, 14.	en_US
dc.source.volume	202	en_US

Tilhørende fil(er)

Filnavn:: LIPIcs-MFCS-2021-14.pdf
Størrelse:: 811.5Kb
Format:: PDF
Beskrivelse:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Department of Informatics [917]
Registrations from Cristin [9487]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 4.0 Internasjonal