Social Capital in Networks

Monday, October 01, 2007

Improved Algorithm for Learning of Gene Regulatory Network Connectivity from Time Series Data

Barker et al. presents the GeneNet algorithm designed to learn genetic regulatory network connectivity from time series data. The GeneNet algorithm is similar to work by Yu et al (2004), however, it takes a new approach by computing ratios of conditional probabilities and accumulating votes to determine influence between species. The approach taken by Yu et al uses Dynamic Bayesian Networks (DBN) and a cumulative distribution function (cdf) to determine a score for each species that may influence a gene. GeneNet approaches the problem differently by searching for differences between time points.

The pseudocode of the GeneNet algorithm is as follows:

GeneNet(Species S, Expts E, Influences I, Thresholds T, Levels L)
L:=DetermineLevels(S,E,L)
foreach c element of S:
Y:=CreateInfluenceVectorSet(c,S,E,I,T,L)
Y:=CombineInfluenceVectors(c,S,E,I,T,L,Y)
I(c):=CompeteInfluenceVectors(c,S,E,T,L,Y)
return I

Due to the lack of time series data available, synthetic data sets were generated for comparison. Empirical studies were performed which pitted GeneNet versus Yu's DBN algorithm on these synthetic datasets. GeneNet had significantly better precision, recall, and runtime for the majority of experiments.

(See the paper in Transactions on Computation Biology and Bioinformatics, No. 8, March 2007.)

Thursday, September 20, 2007

Metabolite

Quoted from the Columbia Encyclopedia:

metabolite, organic compound that is a starting material in, an intermediate in, or an end product of metabolism. Starting materials are substances, usually small and of simple structure, absorbed by the organism as food. These include the vitamins and essential amino acids. They can be used to construct more complex molecules, or they can be broken down into simpler ones. Intermediary metabolites are by far the most common; they may be synthesized from other metabolites, perhaps used to make more complex substances, or broken down into simpler compounds, often with the release of chemical energy. For example, glucose, perhaps the single most important metabolite, can be synthesized in a process called gluconeogenesis, can be polymerized to form starch or glycogen, and can be broken down during glycolysis in order to obtain chemical energy. End products of metabolism are the final result of the breakdown of other metabolites and are excreted from the organism without further change; they usually cannot be used to synthesize other metabolites.

The Wikipedia article on Bioinformatics is worth scanning to get a feel for the research area.

Thursday, September 13, 2007

Bioinformatics Reading Thoughts

I'm taking a Bioinformatics class and have been reading "System Modeling in Cellular Biology" by Szallasi et. al. Here are the thoughts that I have had while reading the first few chapters.

Data-driven versus hypothesis-driven research
The world is very complex. Science has been used to understand how things work. Science has often been driven by a hypothesis followed by experimentation which then increases our understanding of the problem --- these questions were based on what we observe or maybe a few researchers have observed. Recently, we continue to gather more and more data which also can be used to drive research --- these questions are based not only on what we might observe in life, but additionally on what the data suggests, in some instances of millions of people. Both ways of attacking the problem can lead us to the same truth, however, it seems that the later has more potential of getting us there quicker.

Modeling
Modeling is constantly used in biological research. Szallasi mentions a couple reasons why models might be useful (1) testing whether a model is accurate and relect known facts, and (2) models can help us to understand which parts of the system contribute most to some desired properties of interest.

Robustness
I love how robust and resilient biological processes are. I would love to be able to create a computer program that is a fraction as robust as, say the body is at healing itself.

Modularity
Many biological processes are modular, much like how good programmers would make a function or class. For example, the human kidney can be substituted into another person and it can work successfully in them. Likewise, in programming, code that connects to a database can be used interchangeably withing multiple programs.

Bottom-up versus Top-down approaches
Bottom-up approaches typically build on existing biological knowledge, whereas, top-down approaches leverage the enormous amount of biological data to find something important to then delve into.

Thursday, May 03, 2007

Export data from Postgres

To export data from Postgres to an output file of your choice can be done by following the simple steps below:

1. Start psql with the database that you'd like to export from...

$ psql [DATABASE]

2. Toggle the output mode to unaligned (\a toggles between unaligned and aligned output mode)

=# \a

3. Turn "tuples only" off (\t toggles between tuples on and off)

=# \t

4. Set the output file (replace [FILE] with what you'd like to call your output file). It will send all query results to the file or |pipe.

=# \o [FILE]

5. Run whatever query you'd like to send to the output file. For example,

=# SELECT * FROM [TABLE];

In summary:

\a
\t
\o /tmp/outputfile.txt
SELECT ......
\o

Monday, August 07, 2006

Personalized Marketing

Personalized marketing as a four phase process:

identifying potential customers
determining their needs and their lifetime value to the company
interact with customers so as to learn about them
customize products, services, and communications to individual customers

From Wikipedia, “Personalized marketing,” (Cited: Peppers, D. and Rogers, M. 1993)

Wednesday, June 28, 2006

Dimensionality Reduction Notes

Principal Components Analysis (PCA)

How do you choose how many and which eigenvalues/eigenvectors to use?

Kaiser Criterion
This says to retain only factors with eigenvalues greater than 1. In other words, if a factor does not extract at least as much as the equivalent of one original variable then it is discarded. This criterion is named after Kaiser as he proposed it in 1960. It seems used quite frequently.

The Scree Test
This is a graphical test used to decide how many factors to keep. To perform this test, first, plot the eigenvalues in decreasing order. Next, Cattell suggests to find the place where the smooth decrease of eigenvalues appears to level off (to the right) similar geological scree (loose rock debris at the bottom of a rocky slope).

Here are some other useful terms and definitions from the dictionary:

Multicollinearity refers to linear inter-correlation among variables. Simply put, if nominally "different" measures actually quantify the same phenomenon to a significant degree -- i.e., wherein the variables are accorded different names and perhaps employ different numeric measurement scales but correlate highly with each other -- they are redundant.

Friday, June 23, 2006

K-Means Clustering

Basic Algorithm
1. Choose k cluster centers at random
2. Assign each point to nearest cluster center
3. Compute the new cluster centers based on the assigned points
4. Repeat until cluster centers converge

Shortcomings
Finds local minima
The random placement of cluster centers affects the outcome

Here is a nice K-Means Demo

Monday, May 01, 2006

Taking Genealogy to the Common Person

Saturday, March 18, 2006

Subversion

Handy Links for Subversion:
Download Subversion Client
http://subversion.tigris.org/faq.html

Friday, March 17, 2006

Google and data

Here is an interesting article about Google and all their data.

Tuesday, January 31, 2006

GEDCOM file format information

The following links discuss the how the GEDCOM format is defined:
Cyndi's List
The GEDCOM Standard Release 5.5
GEDCOM: The Next Generation

Friday, January 20, 2006

Networked Data File Types

Here are reference links to common network data file types used in Link Mining and Social Network Analysis:
Pajek Net File
UCINet DL Files and VNA

Friday, December 09, 2005

Machine Learning Topics

Particle Swarm Optimization
wikipedia
Swarm Intelligence

Ant Algorithms
ant colony optimization

Reinforcement Learning
wikipedia
Q-learning
Q-learning definition
Markov decision process

Computational Learning Theory
wikipedia
VC dimension
Principle of maximum entropy

Ensembles, Bagging and Boosting
Boosting

Meta-Learning
METAL KDD
Christophe Giraud-Carrier

HMMs
Hidden Markov model

Saturday, October 29, 2005

Viral Marketing

Dr. Ralph F. Wilson suggests that Viral Marketing is comprised of the following components:
1. Gives away products or services
2. Provides for effortless transfer to others
3. Scales easily from small to very large
4. Exploits common motivations and behaviors
5. Utilizes existing communication networks
6. Takes advantage of others' resources
The effects of word-of-mouth, or viral marketing are motivations for utilizing the social network that customers belong in.