Monday, January 31, 2005
Some helpful scripts
The script performs pairwise t-tests to compare multiple methods on several datasets. It uses the Bonferroni correction to ajust the significance level of the tests for the multiple comparisons. It supposes that the results are paired, like the results obtained, for instance, from a stratified ten-fold cross-validation test. The script reads a text file with the following format:
<dataset1> <method1> <numerical result 1>
<dataset1> <method1> <numerical result 2>
.
.
.
<dataset1> <method2> <numerical result 1>
.
.
.
--------------------------------------------------------------
data<-read.table("results_file.dat")
for(dataset in levels(data$V1)){
print(sprintf("%s",dataset));
data2<-data[data$V1==dataset,];
res<-pairwise.t.test(data$V3,data$V2,p.adj=
"bonf",pool.sd=FALSE,paired=TRUE)$p.value;
rows<-dimnames(res)[[1]];
cols<-dimnames(res)[[2]];
for(m1 in rows){
for(m2 in cols){
val<-res[m1,m2];
if(!is.na(val) && val<0.05){
ave1<-mean(data2[data2$v2==m1,]$v3);
ave2<-mean(data2[data2$v2==m2,]$v3);
if(ave1>ave2){
print(sprintf("Sig %s %s -> %s",dataset,m1,m2))
}else{
print(sprintf("Sig %s %s -> %s",dataset,m2,m1))
}
}
}
}
}
--------------------------------------------------------------
The script prints a message for each significant difference observed in the results, using a confidence level of 95%
Another script I use very often is a perl wrapper for gnuplot, the plotting program from the GNU project. The script generates draft plots just to check how do the results look, without the overhead of launching other programs that can generate more fancy plots like Matlab.
--------------------------------------------------------------
#!/usr/bin/perl -w
open(FH,"gnuplot -persist") or die "Can't exec gnuplot";
print FH "set encoding iso_8859_1\n";
print FH "set data style lines\n";
my $end=0;
my $index=0;
my $line="plot ";
while(not $end) {
if(defined $ARGV[$index]) {
$line.="\"$ARGV[$index]\" with errorbars,";
$index++;
} else {
chop $line;
$line.="\n";
$end=1;
}
}
print FH $linea;
close(FH);
------------------------------------------------------------
This script receives as parameters a variable-length number of data files to overlap in the same plot. The contents of each file should be a stream of lines with the
following format:
<X value> <Y value> <Y max> <Y min>
The program plots the points specified by the data files connected by lines and with errorbars (specified by <Y max> and <Y min>)
The plot is shown in the current display.
I hope this helps
Unnatural selection article gets blog traction
Sunday, January 30, 2005
ISGEC to SIGEVO: From Outlaws to Mainstream?
To some extent it's a matter of bad timing. First wave evolutionary computation researchers (and cyberneticists of all stripes) in the late 50s and 60s were just getting their sea legs under them when Minsky and Papert's premature and misleading hatchet job on neural networks appeared and took the wind out of cybernetic sails generally. Moreover, computers of the day weren't entirely up to the tasks put to them, and a number of prominent pioneering GA/ECs works were prematurely dismissed as no big deal. Finally, the rise of symbolic AI and the subsequent cybernetics winter made it academically disreputable to persist in the "folly" of genetic algorithms and evolutionary computation, but persist the field did, and we are here today because of the courage of a small group who swam against the intellectual currents of the 60s, 70s, and 80s.
Another reason for the current situation is that human lifespans are no longer a good match to the turnover in intellectual ideas. This brings to mind the old story told of physicist Max Plank. When he was asked how the acceptance of his revolutionary ideas in quantum physics was going, it is reported that he said, "Wonderful, lots of funerals." In Plank's day, perhaps the natural turnover in faculty matched the turnover of ideas, but in our time, rapid change in thinking has not been matched with a concomitant shortening of faculty life expectancy. The result is large numbers of powerful faculty in charge with ideas in mind that are more than a little behind the time. Unfortunately, the most obvious solutions to this problem are serious felonies, and no one is here suggesting that steps be taken to cull the herd.
So, it is in this sense, that the acceptance of SIGEVO by ACM couldn't come at a better time, and should be seen as a very positive thing. With GECCO as an ACM conference proceedings, young faculty can safely put their work there without endlessly defending their choice. With ACM affiliated journals in EC and GP, department heads have the moral equivalent of a Good Housekeeping seal of approval on our field to rely upon instead of the pleas of a lone faculty member up for tenure. More importantly, if SIGEVO is seen as part of ACM and CS, it will become easier for CS department heads to hire faculty with GA/EC credentials.
Of course, department heads when asked about tenure and promotion will tell you that each case is special and is scrutinized without regard for brand names. Department heads have told me that the decision to move from ISGEC to SIGEVO will make absolutely no difference to them. But consumers will tell you that their decisions to buy products are also made on a case-by-case basis without regard for brands--as their grocery baskets fill with Crest, Tide, Mach 3, Ragu, Pepsi, Twinkies, and Tide. In a busy world, trusted brands allow buyers to get quality products at low risk of error and low search costs. In the new world of ACM and SIGEVO, CS departments and their heads will be able to rely on a trusted brand in one of the most important decisions they make, the hiring, retention, and promotion of faculty.
It will take some time to know the overall effect, and indeed, joining a large bureaucratic organization like ACM will come with its share of constraints and costs. But, in the end, I believe the young people in the PhD pipeline will benefit immensely through higher probability of hiring, better chances to tenure, and improved prospects of funding as the once outlaw field of genetic algorithms and evolutionary computation comes in off the range, hangs up its six-shooter, and becomes a law-abiding denizen of the mainstream of computer science.
Saturday, January 29, 2005
"Computer sentience is possible," says Holland
"Computer sentience is possible," said John Holland, professor of electrical engineering and computer science and professor of psychology at the University of Michigan. "But for a number of reasons, I don't believe that we are anywhere near that stage right now."Readers of this blog need no introduction to Holland, but they might be somewhat surprised by his views regarding the possibility of computers rivaling human intelligence.
According to Holland, the problem with developing artificial intelligence through things like genetic algorithms is that researchers don't yet understand how to define what computer programs should be evolving toward. Human beings did not evolve to be intelligent--they evolved to survive. Intelligence was just one of many traits that human beings exploited to increase their odds of survival, and the test for survival was absolute. Defining an equivalent test of fitness for targeting intelligence as an evolutionary goal for machines, however, has been elusive. Thus, it is difficult to draw comparisons between how human intelligence developed and how artificial intelligence could evolve.Hat tip to Open Source Protein Structure Prediction for picking up the news release and for republishing Cosma Shalizi's lovely review of Holland's Emergence.
"We don't understand enough about how our own human software works to come even close to replicating it on a computer," says Holland.
According to Holland, advances in software have not kept pace with the exponential improvements in hardware processing power, and there are many artificial intelligence problems that cannot be solved by simply performing more calculations. While hardware performance continues to double almost every year and a half, the doubling time for software performance is at least 20 years.
"In the final analysis, hardware is just a way of executing programs," says Holland. "It's the software that counts."
Canadian researchers apply GAs in mining
Friday, January 28, 2005
Are competent learning classifier systems emerging?
Early efforts of Sierra, Jiménez, Inza, Larrañaga, and Muruzábal showed that such algorithms may also be used in genetics-based machine learning approaches. They build a simple Pittsburgh approach classifier based on EDAs. Recently, however, such approaches have hit the main stream of LCS, XCS.
The usage of EDAs can help addressing two major issues in XCS: (1) knowledge extraction and (2) structure identification. Knowledge extraction addresses the issue of mining problem knowledge from the final solution developed by XCS. The goal is to identify most important features in the problem and the dependencies among those features. The extracted knowledge may not only be used for further data mining, but may actually be re-fed into the system giving it further competence in solving problems in which dependent features, that is, building blocks, need to be processed effectively. A paper proposing to extract a feature dependency tree out of the developed rule-based problem representation of XCS may be found here.
Green GAs and the random keys guy
J-H Chen PhD takes top prize in Taiwan
Thursday, January 27, 2005
Data mining tools
The first one I started using was WEKA. It has a nice collection of a classification, regression, and clustering algorithms. Written in Java, it is easy to use, providing a flexible environment for rapid preliminary filtering and analysis of raw data. Recently, I have notice the existence at least more than 20 different projects using such framework.
Lately, I have moved from WEKA to D2K, a data mining framework developed by the Automated Learning Group at the National Center for Supercomputing Applications. It is again pure Java. The thing I like the most about D2K, and one of the reasons for switching, is the data flow oriented paradigm that it uses. Using an intuitive graphical editor, complicated analysis and visualization task are rapidly deployed by simple drag & drop. I have been heavily using D2K in the DISCUS project, and I have no regrets about not using WEKA anymore. Only good words is what I have about D2K's quality and how much effort the ALG people put into it to make a great package easy to extend and customize.
Another tool I want to mention is a pretty specialized library. LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC ), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM ). They provide sources in C++, Java, and C# .NET, and interfaces to Python, R , Matlab, Perl, and Ruby interfaces. I have been using it in some of my recent research, and if you are interested in such areas, I definitely recommend you to take a look at it.
And these leads me to one of my favorite tools, R. R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. Contributed packages include all sorts of tools. I just want to point out the project on graphical models. I would recommend such tool to anyone who wants to speed up the analysis of the results that his/her GAs generate.
GA-Based System saves lives in Iraq
Milligan thought it was possible, and on Nov. 17, 2003, BBN signed the contract with DARPA. The company's researchers knew that the physics was straightforward. Like supersonic airplanes, bullets create shock waves -- mini-sonic booms -- as they speed through the air. So if engineers arrange seven microphones like the spines of a sea urchin, a shock wave from a bullet will hit each microphone at a slightly different time, like a wave lapping against different pebbles on a beach. By measuring those time differences, it's possible to calculate the trajectory.This story is fine as far as it goes, but as GA afficianadoes well know, the BBN effort was supported by a crack team of card-carrying genetic algorithmists. An article written by BBNers Hussain, Montana, Brinn & Cerys entitled Genetic algorithms for UGV navigation, sniper fire localization, and unit of action fuel distribution was presented last June at GECCO 2004 at the Workshop on Military and Security Applications of Evolutionary Computation (MSAEC-2004). The paper details the development of Boomerang and other GA-based military-security applications. These efforts are the result of longstanding GA work at BBN initiated by Dave Davis (now of NuTech Solutions) and carried on by Dave Montana and other hardworking GA types. It is difficult enough working in the obscure vineyards of GAs and EC, but when a group used to toiling in obscurity does something deserving of national recognition, it would be nice if the mainstream business press would get the story right and at least mention the names of the people who really got the job done.
As it turned out, "it was much harder than we thought it would be," says Milligan. Not only is the math difficult, but the system also had to work in the cacophony of urban warfare, including echoes from shots -- and do it on the move.
This year's Genetic and Evolutionary Computation Conference (GECCO-2005) is 25-29 June 2005 in Washington DC. This year's Military and Security Applications of Evolutionary Computation Workshop (MSAEC-2005) is held at GECCO-2005 on Saturday, June 25, 2005.
Wednesday, January 26, 2005
Revisiting Pittsburgh
Yesterday, I was reading again Jaume's paper on incremental learning. A simple idea such as a round robin of small disjoint training sets greatly helps Pittsburgh classifier systems in the quest for generality. On the Michigan approach, however, such issue was settled down after the idea of evolving classifiers based on accuracy by Wilson (now celebrating the 10th anniversary of its publication).
The more I think about it, the more I wonder how Wilson's ideas, that gave birth to XCS, could be carried over the Pittsburgh realm, where most of the work done has focused only on evolving rule sets based on the overall performance (pure classification accuracy). If such ideas could be carried over the Pittsburgh realm, the basis for a renewed genetics-based machine learning paradigm would be sketched.
Marketing, interactive GAs, and the company that broke my heart
Over a number of years, I've followed research in iGAs and some time ago, I started to collect information about market research and product development with the idea of founding a company to do just that using interactive GA technology. My dreams of a great startup were dashed one semester when during my genetic algorithms course, some students researching their class project uncovered a company that was already vigorously pursuing interactive GAs in marketing applications. That company, Affinova, has married iGAs and practical market research and product development notions to create a unique line of products and services. Subsequently I met company executive Rob Frasca in another context, and got the chance to tell him how Affinova was the company that broke my entrepreneurial heart. For those interested in a nice case study of marrying GAs and business read the Affinova white paper here.
Tuesday, January 25, 2005
Growth in market for GAs
With IDC predicting compounded growth of 8% per year in the predictive analytics market, perhaps now is the time for some of us to become more entrepreneurial like GA/EC-based companies such as Engineous Software, Machine Insight, NuTech Solutions, and Schema.IDC defines predictive analytics software to “include all analytics, both tools and packaged applications, more complex in their mathematics than core analytics.”
“Predictive analytics are used to determine the probable future outcome of an event or the likelihood of a current state where it is unknown,” the study continues. “These analytics include data mining, clustering, decision trees, market basket analysis, regression modeling, neural nets, genetic algorithms, text mining, hypothesis testing, decision analytics, and so
on.”
Monday, January 24, 2005
Welcome
Starting today, past and present IlliGAL lab members and affiliates will team up to blog on genetic algorithms, evolutionary computation, and related subjects. In the last American election cycle, blogging grew as a force to rival the mainstream media. In less well followed topics, blogging is growing at astonishing rates, and we hope that IlliGAL Blogging will become a resource and focal point for intelligent discourse in the growing community of genetic and evolutionary computation (GEC).
So let the blogging begin. As with other blogs, reader commentary strongly shapes the agenda, so join us with your comments, or join us with your own GEC-related blog, but join us.