Thursday, December 30, 2004

Personalization Papers & Links

Data Mining for Web Intelligence (PDF)
Integrating Web Usage and Content Mining for More Eective Personalization (PDF)
WebData Mining (DePaul University)
AutomaticPersonalization Based on Web Usage Mining
Semantic WebPersonalization
Links& White Papers

Tuesday, December 07, 2004

Website Log Analyzers

AWStats | Comparison to the below Log analyzers

Web Log Analysis Tools
There are several free log analysis tools. AWStats seems to be the most popular open source tool, however, I'm not sure if it is as nice as some of the commercial products such as Omniture's SiteCatalyst.

Monday, December 06, 2004

Information Retrieval

This is a good reference book with commonly used information retrieval methods.

Wednesday, December 01, 2004

Personalized News and Blog Site

Here is new personalization website,
which is done quite well. It is called href="">Findory. It is smart because no login
is required yet personalization occurs from the moment you visit
the site and begin clicking.

with Greg
is a blog that discusses personalization and
customization in web search and online news.

Tuesday, November 23, 2004

Data Mining Interest Continues To Rise

leads data mine plan ( - Australia)

"The University of Technology Sydney is trying to establish a $38
million data mining center of excellence to involve universities,
industry and government..."

It is interesting to see how important Data Mining is becoming.

Thursday, November 11, 2004

Data Mining Dataset & Model Repositories

UCI KDD Archive
The central repository for data mining datasets.

ML UCI Repository
The central ML machine learning repository

PMML Sample
various PMML models for some of the commonly used datasets
(such as Iris, Voting, and Elnino).


PMML Specs
"Predictive Model Markup Language (PMML) is an XML-based language
which provides a quick and easy way for companies to define predictive
models and share models between compliant vendors' applications."

It is sponsored by the Data Mining Group (DMG).

Thursday, October 28, 2004

Data Mining in the news...

Uncle Sam is Watching You
This article talks about various ways in which the government uses data mining.

research ($600,000 funded by Carnegie Mellon) that will that will be
used to create software for "for discovering, visualizing and
exploring significant patterns across large collections of full-text
humanities resources in digital libraries and collections." The
project is titled: "Web-based Text-Mining and Visualization for
Humanities Digital Libraries."

Oracle(R) Data Mining Recognized as a Leader by Independent ...

Latest Version of SPSS Data Mining Workbench Enhances Integration ...

Monday, October 25, 2004

Personalized Web

Philip Chan
The main goal of Philip's research is to "develop techniques for
adaptive and personalized information searching and navigation
guidance on the web."

Related Work

Implicit User Profiling for On Demand Relevance

International Conference on Intelligent User Interfaces

Friday, October 15, 2004

Web Mining (Relationships)

An Exploration of Entity Models, Collective Classification
and Relation Description (PDF)
Interesting research which suggests possibilites of creating question
and answering systems.

Great paper about extracting social networks and contact information from only a user's email inbox.

Tuesday, October 12, 2004

Tuesday, October 05, 2004

ID3 - Machine Learning

I've been studying and coding up the ID3 classification algorithm.
When I get a free moment I'll be reading Text and Web Mining papers.

Wednesday, September 22, 2004

Papers that I'm reading

Low-Complexity Fuzzy Relational Clustering Algorithms for Web Mining (PDF)

Introduction to Data Mining and Knowledge Discovery (PDF)

Saturday, September 18, 2004

Meta Learning (METAL)

These past couple days I have been browsing the Internet and reading more about Data Mining while focusing on Meta Learning. I have posted links to some of the documents that I thought interesting. Unfortunately the main MetaL-KDD website ( is down so I cannot read what is available there. So I've been googling and reading what else is available today.
Discusses basic concepts about Data Mining and Meta Learning
List of Data Mining and Knowledge Discovery (KD) Websites
The gateway to statistics from over 100 U.S. Federal agencies

Weka Metal (Meta Learning Extension for Weka)

References for Christophe Giraud-Carrier

Data Mining Resources (somewhat outdated)
UCL Data Mining
Protein Structure Analysis and Modeling (not sure what this is)

Web site navigation...
PDF version of the citation above

Wednesday, September 15, 2004

Data Mining

Talked with Dr. Giraud-Carrier about the LDSM data and some other interesting projects associated with the following links:

I also read the following article which gives a nice overview of Data Mining:


Monday, September 13, 2004

More Testing

Without using the switches (-mx and -oss), I was able to use 91,080 KB before running out of memory (weka.jarjava.lang.OutOfMemoryError). I ran Weka using the following:
C:\Program Files\Weka-3-4>java -jar

With using the switches (-mx and -oss), I was able to use 123,804 KB before again running out of memory (weka.jarjava.lang.OutOfMemoryError). I ran Weka using the following:
C:\Program Files\Weka-3-4>java -mx100000000 -oss100000000 -jar

Even though I was able to use up to 123,804 KB (~ 121 MB) before running out of memory, it isn't sufficient to produce the results that I would like (By the way, I've been running on a machine with 1GB of RAM).

I have attempted various methods in order to stay within memory limits. As I'm somewhat unsure of what I'm mining for, I have selected attributes that seem merely seem most interesting to me. For instance, I removed every column except for the mission and state. I then ran j48 and it succeeded! The tree visualization, however, wasn't very impressive since it on had these two attributes.

It has been discouraging to be constantly running out of memory.

I clustered the complete dataset and found nothing very interesting. The KMeans clustering algorithm didn't require much memory.

I created a complete E-R Diagram of the LDSM database. I'd like to meet with Dr. Giraud-Carrier, an experienced data-miner, and talk about the diagram and determine what more (if anything) would be interesting to run on the data. I'd like to eliminate useless attribute columns so that I can achieve some interesting results before exhausting the memory.

Saturday, September 11, 2004

Running Weka

I think this may be the best way to run Weka having more memory, even though I still ran out of memory...I believe it allocated more. This is what I did:

> java -mx100000000 -oss100000000 -jar weka.jar

Still trying to learn a best approach to this. I think as I ran it I was able to use more memory, but it didn't allocate 100000000K of memory, which is what I requested.

Friday, September 10, 2004

Data Mining

I have setup Weka and begun working with my LDSM data set. The LDSM dataset is about 15000 records and has a number of attributes. I want to analyze it with Weka and see if I can find anything interesting. I have setup Weka on two computers, but java has run out of memory on both. I believe it is due to the amount of memory that has been allocated to Java, not how much memory each of the computers has.
Attempts at running the suggested command (java -mx100000000 -oss100000000) have been unsucessful thus far.

According to the website the command is used as follows:
Specify the maximum size, in bytes, of the memory allocation pool. This value must a multiple of 1024 greater than 2MB. Append the letter k or K to indicate kilobytes, or m or M to indicate megabytes. The default value is 64MB.

I may simply need to make the number a multiple of 1024. I'll try that next...nope it didn't work.

I tried the following:
C:\>java -Xmx 80m
Invalid maximum heap size: -Xmx
Could not create the Java virtual machine.

C:\>java -Xmx83886080
Usage: java [-options] class [args...]
(to execute a class)
or java [-options] -jar jarfile [args...]
(to execute a jar file)

where options include:
-client to select the "client" VM
-server to select the "server" VM