# Research

I have contributed to several different fields while doing the PhD. My research interests can best be described as applied machine learning and data mining. Refer to my list of publications for more details about the research works that are briefly highlighted on this webpage.

## Subset Selection

There are an ever increasing number of applications that are generating massive amounts of data that is of high dimensionality. Unfortunately, not all of the features in the data informative or meaningful and we are completely unaware of which features are meaningful. My research in subset selection has focused on the development of algorithms that can: (a) scale to massive data sets to meet todays needs, (b) infer on how many features carry information on being ranked as important, and (c) be performed independently of classifier optimization . I developed a *Neyman-Pearson Feature Selection* (NPFS) that achieves these tasks (see GitHub for the code and examples). My ongoing research focuses on sequential learning algorithms for subset selection that do not need to consider the entire feature set at any given round, which is particularly useful if your software programs cannot evaluate the entire feature set at the same time.

## Concept Drift and Class Imbalance

Two of the more common assumptions that applied machine learning researchers make is that: (1) the training & testing data are sampled from a fixed – albeit unknown – probability distribution, and (2) there are an equal number of samples from all classes. The former is referred to as *concept drift* (a.k.a. learning in non-stationary environments) when new data are presented over times, and the latter is known as *class imbalance*. I developed Learn++.NIE and Learn++.CDS, both classifier ensembles, for explicitly addressing these problem jointly, which has been largely understudied in the literature. I have demonstrated that these approaches are quite useful in practice and they beat the state-of-the-art algorithms in terms of statistics that are not a simple error calculation (see this paper).

I have also studied the problem of semi-supervised and transductive learning with concept drift present in a data stream. My algorithms leverage unlabeled data to tune the weights of classifiers for prediction. This form of data-drive decision making proved superior to state-of-the-art algorithms. I have also leveraged theory with practice on determining how we choose to weight a classifier in data streams with concept drift.

I have released a Matlab toolbox on GitHub for incremental learning algorithms.

## Comparative Metagenomics

Metagenomics is the study of genetic material obtained directly from an environmental sample, which means that everything is sequenced from a sample (i.e., all of the organisms). Note that this differs from traditional genomics in that a single genome is generally sequenced. I have applied my expertise in subset selection on metagenomic data to help microbial ecologist determine the protein families or organisms that best differentiate between multiple phenotypes in a study. I have worked with Calvin Morrison to implement subset selection algorithms in QIIME, which is a heavily used analysis tool in microbial ecology.