Monday, February 14, 2005
The abstract of my PhD thesis
Hello,
I am posting the abstract of my recent PhD thesis, which was greatly improved from by visit to IlliGAL last spring.
Title: Pittsburgh genetics-based machine learning in the data mining era: representations, generalization, and run-time
Abstract:
Pittsburgh genetics-based machine learning (DeJong, Spears, & Gordon, 1993) is, among others (Wilson, 1995; Venturini, 1993), an application of evolutionary computation techniques (Holland, 1975; Goldberg, 1989a) to machine learning tasks. The systems belonging to this approach are characterized by evolving individuals that are complete rule-sets, usually variable-length. Therefore, the solution proposed by these kind of systems is the best individual of the population.
When using this approach, we have to deal with some problematic issues such as controlling the size of the individuals in the population, applying the correct degree of generalization pressure across a broad range of datasets, reducing the considerable run-time of the system, being able to solve datasets with diverse kind of attributes, etc. All these issues become even more critical when applied to modern-day data mining problems.
In this thesis we have the general objective of adapting the Pittsburgh model to handle successfully these kind of datasets. This general objective is split in three: (1) Improving the generalization capacity of the model, (2) Reducing the run-time of the system and (3) Proposing representations for real-valued attributes. These three objectives have been achieved by a combination of four types of proposals:
- Explicit and static default rules
- Windowing techniques for generalization and run-time reduction
- Bloat control and explicit generalization pressure techniques
- The Adaptive Discretization Intervals rule representation for real-valued attributes
Some of these proposals are focused only on a single objective, some others solving partially more than one objective at the same time. All these proposals are integrated in a system, called GAssist (Genetic clASSIfier sySTem).
An experimentation process including a wide range of data mining problems based on many different criteria has been performed. The experiments reported in the thesis are split in two parts. The first part studies several alternatives integrated in the framework of GAssist for each kind of proposal. The analysis of these results leads us to propose a small number of global configurations of the system, which are compared in the second part of the experimentation to a wide range of learning systems, showing how this system has competent performance and generates very reduced and interpretable solutions.
As one of the topics of my reseach is the use of default rules, I am very interested in the work of Rob Smith in this topic.
I am posting the abstract of my recent PhD thesis, which was greatly improved from by visit to IlliGAL last spring.
Title: Pittsburgh genetics-based machine learning in the data mining era: representations, generalization, and run-time
Abstract:
Pittsburgh genetics-based machine learning (DeJong, Spears, & Gordon, 1993) is, among others (Wilson, 1995; Venturini, 1993), an application of evolutionary computation techniques (Holland, 1975; Goldberg, 1989a) to machine learning tasks. The systems belonging to this approach are characterized by evolving individuals that are complete rule-sets, usually variable-length. Therefore, the solution proposed by these kind of systems is the best individual of the population.
When using this approach, we have to deal with some problematic issues such as controlling the size of the individuals in the population, applying the correct degree of generalization pressure across a broad range of datasets, reducing the considerable run-time of the system, being able to solve datasets with diverse kind of attributes, etc. All these issues become even more critical when applied to modern-day data mining problems.
In this thesis we have the general objective of adapting the Pittsburgh model to handle successfully these kind of datasets. This general objective is split in three: (1) Improving the generalization capacity of the model, (2) Reducing the run-time of the system and (3) Proposing representations for real-valued attributes. These three objectives have been achieved by a combination of four types of proposals:
- Explicit and static default rules
- Windowing techniques for generalization and run-time reduction
- Bloat control and explicit generalization pressure techniques
- The Adaptive Discretization Intervals rule representation for real-valued attributes
Some of these proposals are focused only on a single objective, some others solving partially more than one objective at the same time. All these proposals are integrated in a system, called GAssist (Genetic clASSIfier sySTem).
An experimentation process including a wide range of data mining problems based on many different criteria has been performed. The experiments reported in the thesis are split in two parts. The first part studies several alternatives integrated in the framework of GAssist for each kind of proposal. The analysis of these results leads us to propose a small number of global configurations of the system, which are compared in the second part of the experimentation to a wide range of learning systems, showing how this system has competent performance and generates very reduced and interpretable solutions.
As one of the topics of my reseach is the use of default rules, I am very interested in the work of Rob Smith in this topic.