IEEE SSCI in Orlando, FL

Matt Austen, Gail Rosen, Robi Polikar, and I have a paper in the IEEE Symposium of Computational Intelligence and Data Mining, which is a part of the IEEE Symposium Series on Computational Intelligence. I’ll be in Orlando, FL to present the work as well as participate in the Doctoral Consortium.

I am on the technical program committee for the special session on “Concept Drift, Domain Adaptation & Learning in Uncertain Environments” at the International Joint Conference on Neural Networks (held in Killarney, Ireland). You can find the call for papers here.

Performing Tasks Such As Monte Carlo Simulations on a Cluster

I have been using Drexel’s cluster more and more over the past few months, and I have had several people ask me for help on how to submit jobs. In particular, how can I do something like a Monte Carlo simulation. I figured I would share a way to go about doing it with Univa Grid Engine and a simple Matlab program.

First, we need a Matlab program that does a single simulation. To enable reproducibility, the random seed needs to be set, and ideally each simulation needs to have a different random seed. To see why, open Matlab and run rand. Then close Matlab and do it again. Notice anything interesting. Now, set the random seed with some integer which we need to pass in through our SGE script. Our Matlab script is going to set the random seed then do something, such as your simulation then write an output file, and we are provided the file name. This script looks like:

function matlab_demo(output_fp, seed)
  % set the random seed.
  rng(seed);               % IMPORTANT
  a_random_number = rand;  % or do you simulation here
  save(output_fp);
end

Next, we need a shell script (I’ll call mine submitter.sh) to submit to grid engine. Nothing fancy here, just make sure you have two slots set aside with pe shm and set -t to contain the number of simulations you need to run. With every task, grid engine will assign a task ID ($SGE_TASK_ID). We can use the ID to: (i) set our random seed, and (ii) give our output file a unique name. The other thing to watch out for is where you write your files to. It seems convenient to write the output file to your home directory; however, this should not be perform, particularly if you have several files you’re writing in a single script. Therefore, you should write to the scratch space, which is in $TMP, then move the file from scratch to your home.

#!/bin/bash -l
#$ -cwd
#$ -q all.q
#$ -t 1-50
#$ -j y
#$ -M your@email.edu
#$ -P yourGroupsPrj
#$ -S /bin/bash
#$ -pe shm 2
#$ -e /tmp/
#$ -o /tmp/
#$ -l h_vmem=3G
#$ -l h_rt=72:00:00

# boiler plate module loading
. /etc/profile.d/modules.sh
module load shared
module load proteus
module load sge/univa

# load the modules that your program is going to need
module load matlab/R2013a

# set the matlab path if you have other scripts your going to need
MATLABPATH=/path/to/your/matlab/files/

# setup a couple enviromental variables
# - write a different file for each task in the array job (this may be
#   analagous to 1 MC similation.
temp_fp=${TMP}/result_file_${SGE_TASK_ID}.mat
# - where are we going to write the result in our home directory
file_fp=/home/your_home/result_file_${SGE_TASK_ID}.mat

# call our matlab function, you may need to remove "singleCompThread" if
# your using the parallel computing toolbox
matlab -singleCompThread -nosplash -nodisplay -r "matlab_demo('${temp_fp}', ${SGE_TASK_ID})"

# now that we have saved our file to the scratch space in the matlab program,
# move the file back to our load folder.
cp ${temp_fp} ${file_fp}

Then in the shell:

qsub submitter.sh

And thats about it! Now all you need to do is write a reduce script to summarize the results from the 50 simulations. Always refer to the documentation if you need to know how to set the flags of your grid engine script.

ACM International Workshop on Big Data in Life Sciences

Gail Rosen and I have an invited talk at the ACM International Workshop on Big Data in Life Sciences (BigLS), which is being held in conjunction with the ACM BCB conference. I will be in Newport Beach, CA to give the talk. We have released most of the code required to reproduce the result on GitHub. Note that the shell script in the root of the directory is used by be to run IPython on our lab’s server. If you’re interested, you can find instructions on how to do this here.

Notes from the WCCI

I sat through several sessions and met some really great people at the WCCI in Beijing. First, Yann LaCun gave one of the best plenary lectures I have had the pleasure of sitting through. Yann gave a great overview of learning representations and convolutional neural networks, and finished off the talk with a really impressive demo. Here are a couple of notes that I took sitting through some of the sessions.

  • Paul Werbos gave a great talk on where he sees some of the grand challenges. One of the themes of his talk was combining approaches from supervised, unsupervised and reinforcement learning, which is an area of interests in his division at the NSF as well as improvements to the optimization of ADP problems.
    P. Werbos, “From ADP to the Brain: Foundations, roadmap, challenges and research priorities,” in International Joint Conference on Neural Networks, 2014.
  • S. Wang et al. presented an interesting multi-objective optimization problem to find the Pareto-optimal weights for OOB and UOB, such that the minority class and majority class recalls are simultaneously maximized.
    S. Wang, L. L. Minku, and X. Yao, “A multi-objective ensemble method for online class imbalance learning,” in International Joint Conference on Neural Networks, 2014.

Maybe I’ll update this later with more notes.

PhD Research Proposal Approved

My PhD research proposal was approved by my committee members on April 11, 2014. The general topic is to develop a computationally-efficient sequential learning framework — suitable for large scale or streaming data sets — that can determine the most relevant features for a user-defined objective function given no prior information. Some wrapper and embedded methods can select the most important features with little to no prior information; however, such methods must also learn the classifier parameters, which can be computationally burdensome or intractable for incremental learning or massive data sets. One of the goals of the proposed research is to develop a generalizable sequential learning subset selection (SLSS) framework that selects features most relevant for any objective function and can be paired with incremental learning algorithms. Such an approach has been largely under-explored and hence conspicuously missing in the literature despite an ever-increasing number of applications that desperately need fast computation and flexibility of optimization.

IJCNN Manuscript Accepted

A manuscript I wrote with my advisors, Gail Rosen and Robi Polikar, has been accepted to appear in the proceedings of the International Joint Conference on Neural Networks (IJCNN). This manuscript continues the work that we wrote for the CIDUE in 2013, which examined the loss of a multiple expert system (MES) making predictions under the assumptions of concept drift.

I this latest manuscript, we dive further into the analysis from the previous work and form a more decisive upper bound on the loss of the MES. We also provided some experiments on real-world data streams support the analysis.

I have released the code for this manuscript on my Github page. 

On a little be more of an odd note, I was making a presentation on PCoA and PCA in Python, and I wanted to put a funny picture at the end of the talk (right before we open up the shell and start writing code). I couldn’t help but think of the XKCD post on Python.  It turns out that you can import anitgravity into Python! Give this a try:

import antigravity