Barker et al. presents the GeneNet algorithm designed to learn genetic regulatory network connectivity from time series data. The GeneNet algorithm is similar to work by Yu et al (2004), however, it takes a new approach by computing ratios of conditional probabilities and accumulating votes to determine influence between species. The approach taken by Yu et al uses Dynamic Bayesian Networks (DBN) and a cumulative distribution function (cdf) to determine a score for each species that may influence a gene. GeneNet approaches the problem differently by searching for differences between time points.
The pseudocode of the GeneNet algorithm is as follows:
GeneNet(Species S, Expts E, Influences I, Thresholds T, Levels L)
L:=DetermineLevels(S,E,L)
foreach c element of S:
Y:=CreateInfluenceVectorSet(c,S,E,I,T,L)
Y:=CombineInfluenceVectors(c,S,E,I,T,L,Y)
I(c):=CompeteInfluenceVectors(c,S,E,T,L,Y)
return I
Due to the lack of time series data available, synthetic data sets were generated for comparison. Empirical studies were performed which pitted GeneNet versus Yu's DBN algorithm on these synthetic datasets. GeneNet had significantly better precision, recall, and runtime for the majority of experiments.
(See the paper in Transactions on Computation Biology and Bioinformatics, No. 8, March 2007.)
This blog focuses on the relationships that connect us together providing potent insights for decision makers. In addition, a few data mining topics are presented.
Monday, October 01, 2007
Thursday, September 20, 2007
Metabolite
Quoted from the Columbia Encyclopedia:
metabolite, organic compound that is a starting material in, an intermediate in, or an end product of metabolism. Starting materials are substances, usually small and of simple structure, absorbed by the organism as food. These include the vitamins and essential amino acids. They can be used to construct more complex molecules, or they can be broken down into simpler ones. Intermediary metabolites are by far the most common; they may be synthesized from other metabolites, perhaps used to make more complex substances, or broken down into simpler compounds, often with the release of chemical energy. For example, glucose, perhaps the single most important metabolite, can be synthesized in a process called gluconeogenesis, can be polymerized to form starch or glycogen, and can be broken down during glycolysis in order to obtain chemical energy. End products of metabolism are the final result of the breakdown of other metabolites and are excreted from the organism without further change; they usually cannot be used to synthesize other metabolites.
Thursday, September 13, 2007
Bioinformatics Reading Thoughts
I'm taking a Bioinformatics class and have been reading "System Modeling in Cellular Biology" by Szallasi et. al. Here are the thoughts that I have had while reading the first few chapters.
Data-driven versus hypothesis-driven research
The world is very complex. Science has been used to understand how things work. Science has often been driven by a hypothesis followed by experimentation which then increases our understanding of the problem --- these questions were based on what we observe or maybe a few researchers have observed. Recently, we continue to gather more and more data which also can be used to drive research --- these questions are based not only on what we might observe in life, but additionally on what the data suggests, in some instances of millions of people. Both ways of attacking the problem can lead us to the same truth, however, it seems that the later has more potential of getting us there quicker.
Modeling
Modeling is constantly used in biological research. Szallasi mentions a couple reasons why models might be useful (1) testing whether a model is accurate and relect known facts, and (2) models can help us to understand which parts of the system contribute most to some desired properties of interest.
Robustness
I love how robust and resilient biological processes are. I would love to be able to create a computer program that is a fraction as robust as, say the body is at healing itself.
Modularity
Many biological processes are modular, much like how good programmers would make a function or class. For example, the human kidney can be substituted into another person and it can work successfully in them. Likewise, in programming, code that connects to a database can be used interchangeably withing multiple programs.
Bottom-up versus Top-down approaches
Bottom-up approaches typically build on existing biological knowledge, whereas, top-down approaches leverage the enormous amount of biological data to find something important to then delve into.
Data-driven versus hypothesis-driven research
The world is very complex. Science has been used to understand how things work. Science has often been driven by a hypothesis followed by experimentation which then increases our understanding of the problem --- these questions were based on what we observe or maybe a few researchers have observed. Recently, we continue to gather more and more data which also can be used to drive research --- these questions are based not only on what we might observe in life, but additionally on what the data suggests, in some instances of millions of people. Both ways of attacking the problem can lead us to the same truth, however, it seems that the later has more potential of getting us there quicker.
Modeling
Modeling is constantly used in biological research. Szallasi mentions a couple reasons why models might be useful (1) testing whether a model is accurate and relect known facts, and (2) models can help us to understand which parts of the system contribute most to some desired properties of interest.
Robustness
I love how robust and resilient biological processes are. I would love to be able to create a computer program that is a fraction as robust as, say the body is at healing itself.
Modularity
Many biological processes are modular, much like how good programmers would make a function or class. For example, the human kidney can be substituted into another person and it can work successfully in them. Likewise, in programming, code that connects to a database can be used interchangeably withing multiple programs.
Bottom-up versus Top-down approaches
Bottom-up approaches typically build on existing biological knowledge, whereas, top-down approaches leverage the enormous amount of biological data to find something important to then delve into.
Thursday, May 03, 2007
Export data from Postgres
To export data from Postgres to an output file of your choice can be done by following the simple steps below:
1. Start psql with the database that you'd like to export from...
$ psql [DATABASE]
2. Toggle the output mode to unaligned (\a toggles between unaligned and aligned output mode)
=# \a
3. Turn "tuples only" off (\t toggles between tuples on and off)
=# \t
4. Set the output file (replace [FILE] with what you'd like to call your output file). It will send all query results to the file or |pipe.
=# \o [FILE]
5. Run whatever query you'd like to send to the output file. For example,
=# SELECT * FROM [TABLE];
In summary:
\a
\t
\o /tmp/outputfile.txt
SELECT ......
\o
1. Start psql with the database that you'd like to export from...
$ psql [DATABASE]
2. Toggle the output mode to unaligned (\a toggles between unaligned and aligned output mode)
=# \a
3. Turn "tuples only" off (\t toggles between tuples on and off)
=# \t
4. Set the output file (replace [FILE] with what you'd like to call your output file). It will send all query results to the file or |pipe.
=# \o [FILE]
5. Run whatever query you'd like to send to the output file. For example,
=# SELECT * FROM [TABLE];
In summary:
\a
\t
\o /tmp/outputfile.txt
SELECT ......
\o
Monday, August 07, 2006
Personalized Marketing
Personalized marketing as a four phase process:
identifying potential customers
determining their needs and their lifetime value to the company
interact with customers so as to learn about them
customize products, services, and communications to individual customers
From Wikipedia, “Personalized marketing,” (Cited: Peppers, D. and Rogers, M. 1993)
Wednesday, June 28, 2006
Dimensionality Reduction Notes
Principal Components Analysis (PCA)
How do you choose how many and which eigenvalues/eigenvectors to use?
Kaiser Criterion
This says to retain only factors with eigenvalues greater than 1. In other words, if a factor does not extract at least as much as the equivalent of one original variable then it is discarded. This criterion is named after Kaiser as he proposed it in 1960. It seems used quite frequently.
The Scree Test
This is a graphical test used to decide how many factors to keep. To perform this test, first, plot the eigenvalues in decreasing order. Next, Cattell suggests to find the place where the smooth decrease of eigenvalues appears to level off (to the right) similar geological scree (loose rock debris at the bottom of a rocky slope).
Here are some other useful terms and definitions from the dictionary:
Multicollinearity refers to linear inter-correlation among variables. Simply put, if nominally "different" measures actually quantify the same phenomenon to a significant degree -- i.e., wherein the variables are accorded different names and perhaps employ different numeric measurement scales but correlate highly with each other -- they are redundant.
How do you choose how many and which eigenvalues/eigenvectors to use?
Kaiser Criterion
This says to retain only factors with eigenvalues greater than 1. In other words, if a factor does not extract at least as much as the equivalent of one original variable then it is discarded. This criterion is named after Kaiser as he proposed it in 1960. It seems used quite frequently.
The Scree Test
This is a graphical test used to decide how many factors to keep. To perform this test, first, plot the eigenvalues in decreasing order. Next, Cattell suggests to find the place where the smooth decrease of eigenvalues appears to level off (to the right) similar geological scree (loose rock debris at the bottom of a rocky slope).
Here are some other useful terms and definitions from the dictionary:
Multicollinearity refers to linear inter-correlation among variables. Simply put, if nominally "different" measures actually quantify the same phenomenon to a significant degree -- i.e., wherein the variables are accorded different names and perhaps employ different numeric measurement scales but correlate highly with each other -- they are redundant.
Friday, June 23, 2006
K-Means Clustering
Basic Algorithm
1. Choose k cluster centers at random
2. Assign each point to nearest cluster center
3. Compute the new cluster centers based on the assigned points
4. Repeat until cluster centers converge
Shortcomings
Finds local minima
The random placement of cluster centers affects the outcome
Here is a nice K-Means Demo
1. Choose k cluster centers at random
2. Assign each point to nearest cluster center
3. Compute the new cluster centers based on the assigned points
4. Repeat until cluster centers converge
Shortcomings
Finds local minima
The random placement of cluster centers affects the outcome
Here is a nice K-Means Demo
Monday, May 01, 2006
Saturday, March 18, 2006
Friday, March 17, 2006
Tuesday, January 31, 2006
GEDCOM file format information
The following links discuss the how the GEDCOM format is defined:
Cyndi's List
The GEDCOM Standard Release 5.5
GEDCOM: The Next Generation
Cyndi's List
The GEDCOM Standard Release 5.5
GEDCOM: The Next Generation
Friday, January 20, 2006
Networked Data File Types
Here are reference links to common network data file types used in Link Mining and Social Network Analysis:
Pajek Net File
UCINet DL Files and VNA
Pajek Net File
UCINet DL Files and VNA
Friday, December 09, 2005
Machine Learning Topics
Particle Swarm Optimization
wikipedia
Swarm Intelligence
Ant Algorithms
ant colony optimization
Reinforcement Learning
wikipedia
Q-learning
Q-learning definition
Markov decision process
Computational Learning Theory
wikipedia
VC dimension
Principle of maximum entropy
Ensembles, Bagging and Boosting
Boosting
Meta-Learning
METAL KDD
Christophe Giraud-Carrier
HMMs
Hidden Markov model
wikipedia
Swarm Intelligence
Ant Algorithms
ant colony optimization
Reinforcement Learning
wikipedia
Q-learning
Q-learning definition
Markov decision process
Computational Learning Theory
wikipedia
VC dimension
Principle of maximum entropy
Ensembles, Bagging and Boosting
Boosting
Meta-Learning
METAL KDD
Christophe Giraud-Carrier
HMMs
Hidden Markov model
Saturday, October 29, 2005
Viral Marketing
Dr. Ralph F. Wilson suggests that Viral Marketing is comprised of the following components:
1. Gives away products or services
2. Provides for effortless transfer to others
3. Scales easily from small to very large
4. Exploits common motivations and behaviors
5. Utilizes existing communication networks
6. Takes advantage of others' resources
The effects of word-of-mouth, or viral marketing are motivations for utilizing the social network that customers belong in.
1. Gives away products or services
2. Provides for effortless transfer to others
3. Scales easily from small to very large
4. Exploits common motivations and behaviors
5. Utilizes existing communication networks
6. Takes advantage of others' resources
The effects of word-of-mouth, or viral marketing are motivations for utilizing the social network that customers belong in.
Subscribe to:
Comments (Atom)