Social Capital in Networks: 2004

Thursday, December 30, 2004

Personalization Papers & Links

Data Mining for Web Intelligence (PDF)
Integrating Web Usage and Content Mining for More Eective Personalization (PDF)
WebData Mining (DePaul University)
AutomaticPersonalization Based on Web Usage Mining
Semantic WebPersonalization
Links& White Papers

Tuesday, December 07, 2004

Website Log Analyzers

AWStats | Comparison to the below Log analyzers
Analog
Webalizer

Web Log Analysis Tools
There are several free log analysis tools. AWStats seems to be the most popular open source tool, however, I'm not sure if it is as nice as some of the commercial products such as Omniture's SiteCatalyst.

Monday, December 06, 2004

Information Retrieval

This is a good reference book with commonly used information retrieval methods.

Wednesday, December 01, 2004

Personalized News and Blog Site

Here is new personalization website,
which is done quite well. It is called href="http://findory.com/">Findory. It is smart because no login
is required yet personalization occurs from the moment you visit
the site and begin clicking.

Geeking
with Greg is a blog that discusses personalization and
customization in web search and online news.

Tuesday, November 23, 2004

Data Mining Interest Continues To Rise

University
leads data mine plan (NEWS.com.au - Australia)

"The University of Technology Sydney is trying to establish a $38
million data mining center of excellence to involve universities,
industry and government..."

It is interesting to see how important Data Mining is becoming.

Friday, November 19, 2004

Bayes

Bayes
Lecture

Papers

Repository of publications related to Data Mining

Includes interesting papers such as Efficient
algorithms for creating product catalogs, Selective
Markov Models for Predicting Web-Page Accesses, Web
Page Categorization and Feature Selection Using Association Rule and
Principal Component Clustering, and some others that look
interesting.

Thursday, November 11, 2004

Data Mining Dataset & Model Repositories

UCI KDD Archive
The central repository for data mining datasets.

ML UCI Repository
The central ML machine learning repository

PMML Sample
Models various PMML models for some of the commonly used datasets
(such as Iris, Voting, and Elnino).

DMG - PMML

PMML Specs
"Predictive Model Markup Language (PMML) is an XML-based language
which provides a quick and easy way for companies to define predictive
models and share models between compliant vendors' applications."

It is sponsored by the Data Mining Group (DMG).

Thursday, October 28, 2004

Data Mining in the news...

Uncle Sam is Watching You
This article talks about various ways in which the government uses data mining.

Data-Mining
research ($600,000 funded by Carnegie Mellon) that will that will be
used to create software for "for discovering, visualizing and
exploring significant patterns across large collections of full-text
humanities resources in digital libraries and collections." The
project is titled: "Web-based Text-Mining and Visualization for
Humanities Digital Libraries."

Oracle(R) Data Mining Recognized as a Leader by Independent ...

Latest Version of SPSS Data Mining Workbench Enhances Integration ...

Monday, October 25, 2004

Personalized Web

Philip Chan
The main goal of Philip's research is to "develop techniques for
adaptive and personalized information searching and navigation
guidance on the web."

Related Work

Implicit User Profiling for On Demand Relevance

International Conference on Intelligent User Interfaces
http://iuiconf.org/overview.html

Friday, October 15, 2004

Web Mining (Relationships)

An Exploration of Entity Models, Collective Classification
and Relation Description (PDF)
Interesting research which suggests possibilites of creating question
and answering systems.

http://www.ceas.cc/papers-2004/176.pdf
Great paper about extracting social networks and contact information from only a user's email inbox.

Wednesday, October 13, 2004

Text & Web Mining

http://filebox.vt.edu/users/wfan/text_mining.html
http://poorva.com/aie/?Google_WebMining
http://citeseer.ist.psu.edu/cooley97web.html
http://www-users.cs.umn.edu/~mobasher/webminer/survey/survey.html

Data Mining Glossary
http://www.togaware.com/datamining/survivor/Glossary.html (Useful glossary of data mining concepts and vocabulary.)

Andrew McCallum

This guy has some interesting research:
http://www.cs.umass.edu/~mccallum/

Tuesday, October 12, 2004

Bamshad Mobasher

He has done some interesting research regarding web mining and user personalization:
http://maya.cs.depaul.edu/~mobasher/index.html
http://maya.cs.depaul.edu/~mobasher/cgi-bin/view-pubs.pl?CID=WCM

Tuesday, October 05, 2004

ID3 - Machine Learning

I've been studying and coding up the ID3 classification algorithm.
When I get a free moment I'll be reading Text and Web Mining papers.

Wednesday, September 22, 2004

Papers that I'm reading

Low-Complexity Fuzzy Relational Clustering Algorithms for Web Mining (PDF)

Introduction to Data Mining and Knowledge Discovery (PDF)

Monday, September 20, 2004

Web Mining Links

Web Mining
http://www.cs.umbc.edu/~ajoshi/web-mine/
http://www.cs.utexas.edu/users/pebronia/text-mining/
http://www-2.cs.cmu.edu/afs/cs/project/theo-3/www/
http://www.cs.ualberta.ca/~tszhu/webmining.htm
http://filebox.vt.edu/users/wfan/text_mining.html
http://www.kdnuggets.com/
http://www.kddresearch.org/
CMU World Wide Knowledge Base (Web->KB) project

Web Mining Software
http://www.kdnuggets.com/software/web.html
http://www-ai.cs.uni-dortmund.de/SOFTWARE/YALE/download.html

Saturday, September 18, 2004

Meta Learning (METAL)

These past couple days I have been browsing the Internet and reading more about Data Mining while focusing on Meta Learning. I have posted links to some of the documents that I thought interesting. Unfortunately the main MetaL-KDD website (http://www.metal-kdd.org) is down so I cannot read what is available there. So I've been googling and reading what else is available today.

http://www.statsoft.com/textbook/stdatmin.html#meta
Discusses basic concepts about Data Mining and Meta Learning

http://www.kdnuggets.com/websites/data-mining.html
List of Data Mining and Knowledge Discovery (KD) Websites

http://www.fedstats.gov/
The gateway to statistics from over 100 U.S. Federal agencies

Weka Metal (Meta Learning Extension for Weka)

http://www.cs.bris.ac.uk/Publications/pub_by_author.jsp?id=12799

References for Christophe Giraud-Carrier
http://www.scd.ucar.edu/hps/GROUPS/dm/dm.html

Data Mining Resources (somewhat outdated)
UCL Data Mining
Protein Structure Analysis and Modeling (not sure what this is)

Web site navigation...
http://www.dcs.bbk.ac.uk/~mark/download/besttrail.pdf
http://citeseer.ist.psu.edu/levene03navigating.html
PDF version of the citation above

Wednesday, September 15, 2004

Data Mining

Talked with Dr. Giraud-Carrier about the LDSM data and some other interesting projects associated with the following links:
http://www.metal-kdd.org
http://www.ai.univie.ac.at/oefai/ml/metal/metal-bib.html

I also read the following article which gives a nice overview of Data Mining:

Article

Tuesday, September 14, 2004

DB Schema

DB Schema (text file)

Monday, September 13, 2004

More Testing

Without using the switches (-mx and -oss), I was able to use 91,080 KB before running out of memory (weka.jarjava.lang.OutOfMemoryError). I ran Weka using the following:
C:\Program Files\Weka-3-4>java -jar

With using the switches (-mx and -oss), I was able to use 123,804 KB before again running out of memory (weka.jarjava.lang.OutOfMemoryError). I ran Weka using the following:
C:\Program Files\Weka-3-4>java -mx100000000 -oss100000000 -jar

Even though I was able to use up to 123,804 KB (~ 121 MB) before running out of memory, it isn't sufficient to produce the results that I would like (By the way, I've been running on a machine with 1GB of RAM).

I have attempted various methods in order to stay within memory limits. As I'm somewhat unsure of what I'm mining for, I have selected attributes that seem merely seem most interesting to me. For instance, I removed every column except for the mission and state. I then ran j48 and it succeeded! The tree visualization, however, wasn't very impressive since it on had these two attributes.

It has been discouraging to be constantly running out of memory.

I clustered the complete dataset and found nothing very interesting. The KMeans clustering algorithm didn't require much memory.

I created a complete E-R Diagram of the LDSM database. I'd like to meet with Dr. Giraud-Carrier, an experienced data-miner, and talk about the diagram and determine what more (if anything) would be interesting to run on the data. I'd like to eliminate useless attribute columns so that I can achieve some interesting results before exhausting the memory.

Saturday, September 11, 2004

Running Weka

I think this may be the best way to run Weka having more memory, even though I still ran out of memory...I believe it allocated more. This is what I did:

> java -mx100000000 -oss100000000 -jar weka.jar

Still trying to learn a best approach to this. I think as I ran it I was able to use more memory, but it didn't allocate 100000000K of memory, which is what I requested.

Friday, September 10, 2004

Data Mining

I have setup Weka and begun working with my LDSM data set. The LDSM dataset is about 15000 records and has a number of attributes. I want to analyze it with Weka and see if I can find anything interesting. I have setup Weka on two computers, but java has run out of memory on both. I believe it is due to the amount of memory that has been allocated to Java, not how much memory each of the computers has.

http://www.cs.waikato.ac.nz/~ml/weka/tips_and_tricks.html
Attempts at running the suggested command (java -mx100000000 -oss100000000) have been unsucessful thus far.

According to the java.sun.com website the command is used as follows:

-Xmxn

Specify the maximum size, in bytes, of the memory allocation pool. This value must a multiple of 1024 greater than 2MB. Append the letter k or K to indicate kilobytes, or m or M to indicate megabytes. The default value is 64MB.

      -Xmx83886080

     -Xmx81920k

     -Xmx80m

I may simply need to make the number a multiple of 1024. I'll try that next...nope it didn't work.

I tried the following:
C:\>java -Xmx 80m
Invalid maximum heap size: -Xmx
Could not create the Java virtual machine.

C:\>java -Xmx83886080
Usage: java [-options] class [args...]
(to execute a class)
or java [-options] -jar jarfile [args...]
(to execute a jar file)

where options include:
-client to select the "client" VM
-server to select the "server" VM
... {MERELY PRINTED OUT USAGE INFORMATION}