Dan Gusfield

Professor

Ph.D., UC Berkeley, 1980

My primary interests involve the efficiency of algorithms, particularly for problems in combinatorial optimization and graph theory. These algorithms have been applied to study data security, stable matching, network flow, matroid optimization, string/pattern matching problems, molecular sequence analysis, and optimization problems in population-scale genomics. Currently, I am focused on string and combinatorial problems that arise in computational biology and bioinformatics. I served as chair of the computer science department at UCD from July 2000 until August 2004, and am now the founding Editor-in-Chief of The IEEE/ACM Transactions of Computational Biology and Bioinformatics.

Introduction to The IEEE/ACM Transactions of Computational Biology and Bioinformatics

      Office: 2125 Kemper Hall (Engineering II)
      Phone: (530) 752-7131
      E-mail: gusfield@cs.ucdavis.edu

Please Note: the correct URL for this page is: http://wwwcsif.cs.ucdavis.edu/~gusfield NOT : http://wwwcsif.cs.ucdavis.edu/gusfield (note the missing  ~)  Please bookmark this page for later reference.

Winter 2008 - The website for the class ECS 30 Winter 2008 is www.cs.ucdavis.edu/~gusfield/cs30w08
ECS 30 Winter 2008

Since Spring 2000 I have been teaching an undergraduate course CS 124 Theory and Practice of Bioinformatics. See The Expanded Syllabus for CS124 for more information.
The current (Spring 2008) class webpage .

Bioinformatics Course Lecture Videos:

CS124 Lecture Videos

These are lecture videos for CS 124 taken in 2002. For the topics covered, they are still surprisingly current. Clicking will download a word document with web links to the lectures in both Windows Media player and Quicktime formats. Synopses of the lectures can be found at: Synopses of Bioinformatics Lecture Videos

Graduate Algorithms Course Lecture Videos:

CS222A Lecture Videos

These are lecture videos for CS 222A taken in fall 2007. This is the required graduate course at UC Davis Computer Science on Design and Analysis of Efficient Computer Algorithms.

The First Storer Symposium on Networks Biology and Computation, September 25, 2006

September 25, 2006
10:00am - 6:00pm
Kleiber Hall, UCD Campus

The Symposium will cover a wide range of topics in Networks Biology from issues of acquiring the data needed to characterize biological networks, to network modeling, to computational analysis and comparison of the networks, to biological insights derived from those analyses. The symposium will end with a round-table discussion on the challenges, prospects and future directions of the field. The symposium is supported by the Storer Endowment at UCD and is free and open to all.
For more information, see Symposium on Networks Biology and Computation

The Storer Symposium is in support of the UCD initiative on Computational Networks Biology. For more information on the initiative, see Computational Networks Biology Initiative at UC Davis

Some Recent Publications

For selected recent publications: recent publications

Some Errata for Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology can be found in Errata.

Some Recent Talks

For powerpoint of some recent talks: some recent talks

Software

Program CCHI, Clark Consensus Haplotype Inference, June 2007: A full implementation of Clark Consensus Haplotype Inference Method

Integer Programming Formulations for Phylogenetic and Population Genetic Problems on Haplotypic and Genotypic Data.

June 9, 2006
We have posted twelve Perl programs that produce integer programming formulations (that can be input in lp format and solved by Cplex or GLPK) for seven problems that arise in phylogenetics and population genetics. These problems and the ideas behind their ILP solution are discussed in the paper: ``Integer Programming Formulations and Computations Solving Phylogenetic and Population Genetic Problems with Missing or Genotypic Data"
D. Gusfield and Y. Frid and D. Brown
LNCS, vol. 4598, Proceedings of the 13'th Annual International Conference on Combinatorics and Computing, 2007, p. 51-64

Problem M1: Given a binary matrix M with some entries missing, fill in (impute) the missing values so as to **minimize** the number of incompatible pairs of sites in the resulting matrix M'.

Programs misstest.pl and bbmisstest.pl create the ILP formulations to solve this problem. Both ILP formulations solve very rapidly over a wide range of data, but the formulation created by bbmisstest.pl solves faster.

Problem HM1: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so as to minimize the number of incompatible pairs of sites in the resulting haplotype matrix M'.
Programs hmisstest.pl and hbbmisstest.pl create the ILP formulations to solve this problem. Both ILP formulations solve reasonably rapidly over a wide range of data, but the formulation created by hbbmisstest.pl solves faster.

Problem TR1: Given a binary matrix M with no entries missing, where some pairs of columns are not compatible (i.e., contain all four gametes), select the **minimum** number of columns to remove, so that there are no pairs of columns that are incompatible in the resulting matrix M'.
Program mcontest.pl creates the ILP formulations to solve this problem by CPLEX or GLPK (or other ILP solvers).

Problem HR1: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so as to **minimize** the number of columns that have to be removed in order that no incompatible pairs of sites remain in the resulting haplotype matrix M'.
Program hmcontest.pl creates the ILP formulations to solve this problem by CPLEX or GLPK (or other ILP solvers).

Problem HKM1: Given a binary matrix M with some entries missing, fill in (impute) the missing values so as to **minimize** the resulting Hudson-Kaplan lower bound on the number of recombinations needed to generate the binary sequences, under the infinite sites model of mutations.

Programs misstesthk.pl and bbmisstesthk.pl create the ILP formulations to solve this problem. Both ILP formulations solve very rapidly over a wide range of data, but the formulation created by bbmisstesthk.pl solves faster.

Problem HKHM1: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so as to minimize the resulting Hudson-Kaplan lower bound on the number of recombinations needed to generate the binary sequences, under the infinite sites model of mutations.
Programs hmisstesthk.pl and bbhmisstesthk.pl create the ILP formulations to solve this problem. Both ILP formulations solve reasonably rapidly over a wide range of data, but the formulation created by hbbmisstesthk.pl solves faster.

Problem MinPPH: Given a ternary matrix M representing the genotypes of n individuals, find a pair of haplotypes for each of the n individuals that explain the genotypes (that is, phase the unphased genotype data), so that the deduced haplotypes fit a perfect phylogeny (that is, have no pairs in conflict), and minimize the number of distinct haplotypes used (i.e., over all solutions to the phasing problem that fit a perfect phylogeny).
Programs minpph.pl and bbminpph.pl create the ILP formulations to solve this problem. Both produce formulations that solve reasonably quickly, but the formulation created by bbminpph.pl solves much faster than the one created by minpph.pl.

To download any of these programs, click on:
misstest.pl
bbmisstest.pl
hmisstest.pl
hbbmisstest.pl
mcontest.pl
hmcontest.pl
misstesthk.pl
bbmisstesthk.pl
hmisstesthk.pl
bbhmisstesthk.pl
minpph.pl
bbminpph.pl
example input file data1g that can be used as input by any of these programs

Integer Programming Formulations To Solve Maximum Parsimony Haplotyping

The following two programs create integer programs (solvable by Cplex or GLPK) that find haplotype pairs for a set of input genotypes (that is, the haplotypes phase the genotypes), **minimizing** the number of distinct haplotypes used in the solution. The first program, minthap.pl, implements the ILP formulation called RTIP, generating an integer program whose solution minimizes the **total** number of distinct haplotypes used, including haplotypes obtained from homozygous genotypes or single site ambiguous genotypes in the input set of genotypes, even if those haplotypes are not used to resolve any of the multi-site ambiguous genotypes in the input. The second program, mindhap.pl, generates an integer program whose solution minimizes the number of distinct haplotypes needed to phase just the ambiguous genotypes in the input.
The methods underlying these programs are described in the paper ``Haplotyping by Pure Parsimony", which can be obtained from the recent papers page.

minthap.pl
mindhap.pl

Phylogenetic Networks with Recombination (Ancestral Recombination Graphs)

March 28, 2006
We have posted three sets of programs that concern problems of deriving and analyzing Phylogenetic Networks. The first set of programs, HapBound-GC and SHRUB-GC, concern unconstrained recombination with both crossovers and gene-conversions allowed. The second set, HapBound and SHRUB, concern unconstrained recombination with only crossovers allowed. The third set of programs concern Galled Trees, which are phylogenetic networks where the recombination cycles are constrained to be disjoint.

HapBound-GC and SHRUB-GC


March, 2006
HapBound-GC computes lower bounds on the number of single-crossover recombinations and gene-conversions needed to generate a set of binary sequences under the infinite sites assumption (e.g., one mutation per site). SHRUB-GC constructs an actual network (ARG) that derives the set of sequences, minimizing the total number of crossover and conversion events (when run to completion). See the RECOMB 06 paper on the list of recent papers for an explanation of the programs and thier underlying ideas.
To download HapBound-GC and SHRUB-GC click on
HapBound-GC and SHRUB-GC

HapBound and SHRUB


April 2005
HapBound computes lower bounds on the number of recombinations needed to generate a set of binary sequences under the infinite sites assumption (e.g., one mutation per site). SHRUB constructs an actual network (ARG) that derives the set of sequences. We have found through extensive evaluation that the lower bounds computed by HapBound are close (often equal) to the number of recombinations used by SHRUB. See the ISMB 2005 paper by Song, Wu and Gusfield on the recent papers list for more on this.
To download HapBound and SHRUB click on
HapBound and SHRUB

Galled Tree Programs

October 12, 2004
We are now beginning to release several programs concerned with the reconstruction of phylogenetic network with constrained and unconstrained recombination, given a set of binary sequences. The first two programs determine if the input sequences can be reconstructed with a galled-tree network, i.e., a phylogenetic network where all the cycles caused by recombinations are non-intersecting (node disjoint). One version allows only single crossover recombination, while the other program allows multiple crossover recombination. When an ancestral sequence has been specified in advance, and there is a galled-tree for the input sequences using that ancestral sequence, the galled-tree produced by the program is guaranteed to use the minimum number of recombination nodes (events) over all possible phylogenetic networks that use the specified ancestral sequence and that generate the input sequences. When no ancestral sequence is specified in advance, and there is a galled-tree for the input sequences using some (program determined) ancestral sequence, the galled-tree produced is guaranteed to use the minimum number of recombination nodes over all possible phylogenetic networks that generate the input sequences, and all possible choices of ancestral sequence. More details are several papers available from the recent papers page, and in the README file for the software.

To download a tar file with the programs and associated files, click on
galledtree programs

Perfect Phylogeny Haplotyping

We have four related software packages for a question of current interest concerning haplotype inference:

LPPH - A Linear-Time Algorithm for Perfect Phylogeny Haplotyping (NEW, posted May 12, 2005)

GPPH (formerly called PPH) - Perfect Phylogeny Haplotyper based on graph realization

DPPH - Perfect Phylogeny Haplotyper: a direct method

BPPH - Perfect Phylogeny Haplotyper: The Berkeley method

These programs take in unphased, SNP genotype data, and determine whether that data can be explained by pairs of haplotypes (phased data) which satisfy either the three or the four-gamete test, as specified by the user. That is, whether there are pairs of haplotypes that explain the genotype data, and can fit a rooted or an unrooted tree, where each change of state occurs only once in the tree. If there is such a set of haplotype pairs that explain the genotype data, the program produces one set, along with the tree that it fits, and determines if that set is the unique solution to the problem. All three programs produce a count of the number of solutions, and program DPPH also produces a representation of all of them. Program LPPH is the fastest of the four programs.

For background on this problem, and the two methods, see the papers with "Perfect Phylogeny" in the title in: recent papers or see the specific references on the pages linked above.

As a by-product of writing the program GPPH, we also have a stand-alone program for solving the graph-realization problem. The graph realization problem is equivalent to determining if a binary matroid is graphic. To download a copy:
Graph Realization

String Algorithms; Sequence Alignment and Analysis

We also have three software packages that implement string and sequence analysis algorithms in computational biology and bioinformatics: XPARAL and Strmat and Seqio

XPARAL is an X-Windows based program which computes a parameterized alignment of two sequences, displaying all of the optimal alignments for the two sequences when the costs of substitutions, gaps and insertions or deletions are varied. The user provides as input two sequences, an alignment type (global, local, end-gap free and substring), and selects which two of the alignment costs to vary. The program then displays the upper right quadrant of a two-dimensional graph, where the graph axes specify the possible values of the varying costs. The quadrant is partitioned into convex polygons, where each polygon specifies an alignment that is optimal over the range of costs within the polygon. The user is then able to view and print out those optimal alignments.

Strmat consists of a simple menu system and source C code implementing many exact matching methods, particularly those based on the Z algorithm and on suffix trees.


Seqio-1.2.2 is a bioinformatics/database format conversion package (version 1.2.2) written by Jim Knight in 1996 when he was a postdoc here at UCD. There has been no further development since then on the version available here.
 
 




Return to Gusfield's Departmental page

June, 2006