Selected (Mostly Recent) Publications
November 2008 - Gusfield

Books:


Algorithm on Strings, Trees, and Sequences: Computer Science and Computational Biology
D. Gusfield (Cambridge Press)

The Stable Marriage Problem: Structure and Algorithms
D. Gusfield and R. Irving (MIT Press)

Papers on Haplotype Inference

  1. Analytical and Algorithmic Methods for Haplotype Frequency Inference: What do they tell us?
    S. Orzack, D. Gusfield, L. Subrahmanyan, L. Essioux, S. Lissarrague
    in Bioinformatics Algorithms: Techniques and Applications, I. Mandoiu and A. Zelikovsky (eds.), Wiley 2008, p. 373-394

  2. Algorithms for Imperfect Phylogeny Haplotyping with a Single Homoplasy or Recombination Event (pdf),
    Y. S. Song, Y. Wu and Gusfield
    Proceedings of WABI 2005, October 2005, Lecture Notes in Bioinformatics
    In this paper we look at extensions of the PPH problem (see below) to situations where the underlying haplotypes that form the genotypes do not evolve on a perfect phylogeny, but rather evolve either on a tree with one homoplasy event (a recurrent or back mutation) or on a network with one recombination event. Using one emprically-justified assumption, we present a polynomial time algorithm for the first problem (the case of one homoplasy event) and an exponential-time algorithm for the second problem. We have subsequently developed a polynomial-time algorithm for the second problem as well.

  3. A Linear-Time Algorithm for Perfect Phylogeny Haplotyping (pdf)
    Z. Ding, V. Filkov, D. Gusfield
    This is the expanded, full journal version of the RECOMB conference paper below.
    Journal of Computational Biology, Vol. 13, No. 2, pg. 522-553, 2006

  4. A Linear-Time Algorithm for Perfect Phylogeny Haplotyping (pdf)
    Z. Ding, V. Filkov, D. Gusfield
    Proceedings of RECOMB 2005, May 2005

    click here for postscript version
    This paper solves the open question of whether the Perfect Phylogeny Haplotyping (PPH) Problem can be solved in linear time. The method presented is simple enough for easy implementation, and our implementation is on the web.

  5. Haplotype Inference (pdf)
    D. Gusfield and S.H. Orzack
    In CRC Handbook on Bioinformatics, 2006 (S. Aluru Editor), p. 18-1 18-25

  6. A Note on Efficient Computation of Haplotypes via Perfect Phylogeny
    Vineet Bafna, Dan Gusfield, Sridhar Hannenhalli, Shibu Yooseph
    Journal of Computational Biology, Vol. 11, no. 5, 2004, p. 858-866

    This paper shows that two prior methods for the PPH problem are unlikely to be implementable in linear time. It also shows that the problem of finding a PPH solution that minimizes the number of distinct haplotypes used (over all PPH solutions) is NP-hard.

  7. An Overview of Combinatorial Methods for Haplotype Inference (pdf)
    In Computational Methods for SNPs and Haplotype Inference, S. Istrail, M. W aterman, and A. Clark (eds). Lecture Notes in Computer Science, vol. 2983, Springer, p. 9-25, 2004.

    click here for postscript version
    (2004) A survey of combinatorial algorithms and software
    for haplotype inference developed at UC Davis.
    D. Gusfield

  8. Empirical Exploration of Perfect Phylogeny Haplotyping and Haplotypers (pdf)

    click here for postscript version
    R.H. Chung and D. Gusfield
    Proceedings of the 2003 Cocoon Conference July 2003
    Link to the proceedings
    This reports on the performance of three perfect phylogeny haplotyping programs, and on two interesting phenomena observed when solving the perfect phylogeny haplotyping problem on simulated data.

  9. Haplotyping by Pure Parsimony (pdf)
    click here for postscript version
    Technical Report UCDavis CSE-2003-2
    D. Gusfield
    January 23, 2003,
    Final version in The Proceedings of the 2003 Combinatorial Pattern Matching Conference, June 2003
    Link to conference proceedings

    The Pure Parsimony problem for Haplotype Inference is to find the smallest set of binary strings that can generate an input set of genotypes. The Pure Parsimony problem is NP-hard, and no paper has previously shown how an optimal Pure-Parsimony solution can be computed efficiently for problem instances of the size of current biological interest. In this paper, we show how to formulate the Pure-Parsimony problem as an integer linear program; we explain how to improve the practicality of the integer programming formulation; and we present the results of extensive experimentation we have done to show the time and memory practicality of the method, and to compare its accuracy against solutions found by the widely used general haplotyping program PHASE. The results are that the Pure Parsimony problem can be solved efficiently in practice for a wide range of problem instances of current interest in biology. Both the time needed for a solution, and the accuracy of the solution, depend on the level of recombination in the input strings. The speed of the solution improves with increasing recombination, but the accuracy of the solution decreases with increasing recombination.

  10. Haplotyping as Perfect Phylogeny: A direct approach (pdf).
    Haplotyping as Perfect Phylogeny: A direct approach (postscript)
    Technical Report UCDavis CSE-2002-21
    V. Bafna, D. Gusfield, G. Lancia, S. Yooseph
    July 17, 2002.
    An augmented version has appeared in Journal of Computational Biology, Vol. 10 No. 3 (2003) p. 323-340.

    This develops a more direct, self-contained method (than the first paper in the area: "Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions") determining whether unphased genotype data can be explained by haplotypes that fit a perfect phylogeny. It gives a more intuitive and algorithmicly simpler way to create the representation of all such solutions. An implementation of this method is available at:
    DPPH .

  11. Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions. (pdf)
    Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions (postscript)
    D. Gusfield
    Appeared in RECOMB 2002, April 2002.
    This paper develops an almost-linear-time algorithm to determine if unphased genotype data can be explained by haplotype pairs which fit a rooted perfect phylogeny. If there is such an explanation, then the algorithm, in linear additional time, determines if the solution is unique, and if not, produces a representation of all the solutions. The method is based on reducing the haplotype problem to a problem of graph realization. We use this reduction to graph realization, in the software package GPPH (previously called PPH). However, in GPPH, we use a somewhat slower way to solve the graph realization problem than is described in the paper.

    Errata for: Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions (pdf)
    Errata for: Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions (postscript)


  12. Perfect Phylogeny Haplotyper: Haplotype Inferral Using a Tree Model (pdf)
    Perfect Phylogeny Haplotyper: Haplotype Inferral Using a Tree Model (Postscript)
    R.H. Chung and D. Gusfield.
    This gives a short description of the program PPH (now called GPPH), which implements the solution method discussed in the prior paper and Errata.
    Bioinformatics, Vol. 19, No. 6, pp. 780-781, 2003.

  13. Inference of Haplotypes in Samples of Diploid Populations: Complexity and Algorithms.(pdf)
    Inference of Haplotypes in Samples of Diploid Populations: Complexity and Algorithms (postscript)
    D. Gusfield
    In Journal of Computational Biology, Vol. 8, No. 3, 2001 p. 305-323

  14. The absolute and relative accuracy of haplotype inferral methods and a consensus approach to haplotype inferral. S. H. Orzack, D. Gusfield, V.P. Stanton.
    Abstract in 51'th Annual Meeting of the American Society of Human Genetics, Fall 2001.
    The full, much expanded paper (in Genetics, listed next) details how to modify Clark's well-known and widely used method of haplotype inferral to greatly boost its accuracy, making it nearly competitive in accuracy with PHASE, and vastly faster. This paper is almost unique in the literature in that it compares the accuracy of the methods to laboratory determined haplotype data - determined by one of the authors. It is the joint collaboration of a computer scientist, a geneticist and molecular biologists.

  15. Analysis and Exploration of the Use of Rule-Based Algorithms and Consensus Methods for the Inferral of Haplotypes
    Genetics, Vol. 165, 915-928, October 2003
    Steven Hecht Orzack, Daniel Gusfield, Jeffrey Olson, Steven Nesbitt, Lakshman Subrahmanyanc, and Vincent P. Stanton, Jr.

    Papers on Phylogenetic Networks and ARGs with Recombination

  16. A Decomposition Theory for Phylogenetic Networks and Incompatible Characters
    Dan Gusfield, Vikas Bansal, Vineet Bafna, Yun S. Song
    Journal of Computational Biology Dec 2007, Vol. 14, No. 10: 1247-1272.
    This is a large expansion of the RECOMB 2005 conference paper listed below. It solves the major open question from that conference paper and extends the earlier results.

  17. Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in the Derivation of SNP Sequences in Populations
    Yun S. Song, Zhihong Ding, Dan Gusfield, Charles H. Langley, Yufeng Wu
    Journal of Computational Biology Dec 2007, Vol. 14, No. 10: 1273-1286
    This is the extended journal version of the RECOMB 2006 conference paper listed below.

  18. A new recombination lower bound and the minimum perfect phylogenetic forest problem
    Y. Wu and D. Gusfield
    LNCS, vol. 4598, Proceedings of the 13'th Annual International Conference on Combinatorics and Computing, 2007, p. 16-26

  19. Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants
    Y. Wu and D. Gusfield
    LNCS, vol. 4580, Proceedings of the 18'th Annual Symposium on Combinatorial Pattern Matching, 2007, p. 150-161

  20. An efficiently-computed lower bound on the number of recombinations in phylogenetic networks: Theory and empirical study
    link to Science Direct for the final journal manuscript
    D. Gusfield, D. Hickerson, S. Eddhu
    Discrete Applied Mathematics, Special issue on Computational Biology, Vol. 155, pg. 806-830, 2007

    preprint in pdf

  21. Efficient Computation of Minimum Recombination with Genotypes (not Haplotypes) (pdf)
    Y.Wu and D. Gusfield
    In Proceedings of The CSB Conference, Stanford CA., August 2006.
    The journal version appeared in J. Bioinformatics and Computational Biology, Vol. 5 No. 2(a), 2007, p. 181-200

  22. Algorithms to distinguish the role of gene-conversion from single-crossover recombination in the derivations of SNP sequences in populations (pdf)
    Y.S. Song, Z. Ding, D. Gusfield, C. Langley, Y. Wu
    In Proceedings of RECOMB 2006.
    ps version

  23. Efficient computation of close lower and upper bounds on the minimum number of recombinations in biological sequence evolution
    Yun S. Song, Yufeng Wu and Dan Gusfield
    Proceedings of ISMB 2005, published as a special issue of Bioinformatics

    For the associated software
    HapBound and SHRUB

  24. On the Full-Decomposition Optimality Conjecture for Phylogenetic Networks (pdf)

    D. Gusfield
    UCD Technical Report, Jan. 2005

  25. A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters (pdf)
    click here for postscript version
    D. Gusfield and V. Bansal
    Proceedings of RECOMB 2005, May 2005

  26. Optimal, Efficient Reconstruction of Root-Unknown Phylogenetic Networks with Constrained and Structured Recombination (pdf)
    click here for postscript version
    Technical Report UCDavis CSE-2004-31
    D. Gusfield
    November 25, 2004,
    Journal version appears in J. Computer and Systems Sciences, 2005 Special issue on Computational Biology.
    Vol. 70 (2005) p. 381-398

  27. A fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters (pdf)
    click here for postscript version
    Technical Report UCDavis CSE-2004-32
    D. Gusfield
    October 14, 2004,

  28. Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination (pdf)
    Click here for Postscript
    D. Gusfield, S. Eddhu, C. Langley
    Journal of Bioinformatics and Computational Biology, Vol. 2 no. 1 (2004) p. 173-213
    Special Issue on the 2003 IEEE CSB Bioinformatics Conference.

    This is a much expanded journal version of the conference paper listed next.

  29. Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination (pdf)
    Click here for Postscript
    D. Gusfield, S. Eddhu, C. Langley
    Proceedings of the 2003 IEEE CSB Bioinformatics Conference. This won the best-paper award.
    Thank you Hewlett Packard for donating the prize money.

  30. The Fine Structure of Galls in Phylogenetic Networks (pdf)
    D. Gusfield, S. Eddhu, C. Langley
    INFORMS J. of Computing Special Issue on Computational Biology (2004).

    Other Recent Papers and Proceedings in Computational Biology and Bioinformatics

  31. Integer Programming Formulations and Computations Solving Phylogenetic and Population Genetic Problems with Missing or Genotypic Data (pdf)
    D. Gusfield and Y. Frid and D. Brown
    LNCS, vol. 4598, Proceedings of the 13'th Annual International Conference on Combinatorics and Computing, 2007, p. 51-64

  32. Linear-Time Algorithms for Finding and Representing all Tandem Repeats in a String (pdf)
    D. Gusfield and J. Stoye
    JCSS, vol. 69, 2004, p. 525-546
    We show how to decorate a suffix tree for a string with the endpoints of all the distinct substrings of the string that are tandem repeats. The algorithm runs in linear-time as a function of the length of the string. Of course, such an algorithmic result would not be possible without the remarkable mathematical result, proven by A. Frankel and J. Simpson in 1998, that there can only be a linear number of distinct substrings that are tandem repeats. The actual bound is 2n, for a string of length n. After the suffix tree is decorated, almost all imaginable questions about the tandem repeats or tandem arrays (how many occurrences there are, where they occurr, their average length, the maximum and minimum lengths from given positions etc.) that were previously studied individually, can be answered efficiently using this decorated suffix tree.

  33. RECOMB 2004, Proceedings of the Eighth Annual International Conference on Computational Molecular Biology
    ACM Press
    D. Gusfield (Program Chair), P. Bourne, S. Istrail, P. Pevzner, M. Waterman (editors)

  34. Algorithms in Bioinformatics
    Proceedings of the Second International Workshop on Algorithms in Bioinformatics, WABI 2002, Rome, Italy, September 2002
    Springer Lecture Notes in Computer Science Vol. 2452, 554 pages
    R. Guigo and D. Gusfield (editors)

  35. String Barcoding: Uncovering Optimal Virus Signatures (pdf)
    S. Rash and D. Gusfield, Appeared in RECOMB 2002, April 2002

  36. On the Complexity of Fundamental Computational Problems in Pedigree Analysis (pdf)
    On the Complexity of Fundamental Computational Problems in Pedigree Analysis (postscript)
    A. Piccolboni and Dan Gusfield
    An improved version has appeared in Journal of Computational Biology, Vol 10, No. 5, 2003, p. 763-774.

  37. Partition-distance: A problem and class of perfect graphs arising in clustering (pdf),
    Partition-distance: A problem and class of perfect graphs arising in clustering (postscript),
    Dan Gusfield
    Information Processing Letters, Volume 82, Issue 3, 16 May 2002, Pages 159-164

  38. Simple and flexible detection of contiguous repeats using a suffix tree (pdf) ,
    Simple and flexible detection of contiguous repeats using a suffix tree (postscript) ,
    Jens Stoye and Dan Gusfield
    Theoretical Computer Science, Volume 270, Issues 1-2, 6 January 2002, Pages 843-856

    Other Recent Papers

  39. The structure and complexity of sports elimination numbers (pdf)
    The structure and complexity of sports elimination numbers (postscript)
    Dan Gusfield and Chip Martel,
    In Algorithmica (2002) 32, pages 73-86


Hard to Find Older Papers

  1. Efficient Algorithms for Inferring Evolutionary Trees (pdf)
    Efficient Algorithms for Inferring Evolutionary Trees (ps)
    D. Gusfield

    I get many requests for this paper from people who can't find it in their libraries. It was published in Networks, Vol. 21 (1991) p. 19-28. Networks is no longer a major journal (judging by the libraries that don't receive it, and I guess even by 1991, it was not widely received.). So I am posting the version of it that I found in my files (after much searching). It may not be absolutely identical to the published paper, but it has all the results. The paper has three results: How to find a perfect phylogeny (if there is one) in linear time; a lower bound proof that every entry in the data has to be examined; how to solve the tree compatibility problem in linear time. The first result is also obtained in a somewhat nicer way in my book Algorithms on Strings, Trees and Sequences.

  2. A little knowledge goes a long way: Faster detection of compromised data in 2-D tables(pdf)
    D. Gusfield

    This paper is buried in the Proceedings of the 1990 Oakland conference on Research in Security and Privacy. I don't think anyone knows it is there - I should have published it in a journal.

    The paper considers the following problem: in a 2-D table of non-negative numbers, say n by n, where all the row totals and column totals are known, but only some of the n^2 cell values are known, one wants to determine, for each cell whose value is not known, what is the maximum possible value that the cell could take on. That is, if cell (i,j) has an unknown value, what is the maximum value for cell (i,j) so that there is a setting of all the other unknown (non-negative) values that makes all the rows and columns sum to their correct, fixed values? In this paper I show that no matter how many cell values are unknown, one needs to solve this problem only 2n-1 times. More specifically, I give an algorithm that selects at most 2n-1 of the cells, so that after the maximum value is determined for each of those 2n-1 cells, the maximum value can be obtained for each of the remaining unknown values by doing a single lookup in a table built during the solving of the first 2n-1 problems. So, the maximum value can be determined in constant time for each of the remaining cells. The total time for the algorithm is the time needed for 4n-2 network flow computations in a directed graph of 2n nodes.

  3. New Uses for Uniform Lifted Alignments (pdf)
    D. Gusfield and L. Wang.
    In Mathematical Support for Molecular Biology.
    DIMACS Series on Discrete Math and Theoretical Computer Science, VOL. 47, p. 33-52 (1999)

    This paper is obscure due to a terrible choice of titles. It should have been called something like:

    How to Compute Evolutionary History Really Badly, and Why I Want to.

    The paper uses approximation algorithms in a way that is backwards from what they were designed for, in order to establish bounds on the accuracy of certain computations, rather than trying to find good solutions. The key to doing this is to make an approximation algorithm as inaccurate as possible, while still producing a solution that falls within the worst-case approximation bound. One reviewer of this paper wrote ``This is the dumbest idea I have heard in a long time". We use this approach to try to show that a particular tree alignment of RNA sequences that David Sankoff and Robert Cedergren constructed in 1983 is close to the optimal alignment (under a given objective function). A simple analysis showed that the cost of their alignment is no more than 62% larger than the optimal. In this paper we show that its cost is no more than 16.57% larger than the optimal. I think this idea of using an approximation algorithm backwards can be applied in other problems, but I have never seen the idea picked up by anyone else.

  4. Efficient Methods for Multiple Sequence Alignment with Guaranteed Error Bounds
    Bulletin of Mathematical Biology, Vol. 55, p. 141-154, 1993


    A pre-publication version with all the main results, but a bit more raw than the published journal version.

    This was the first use of a bounded approximation algorithm in computational biology. The particular multiple alignment method developed in the paper was only for the purpose of being able to prove a guaranteed bound on the error, and hence the result in the paper is mainly theoretical. However, it has been reported that the actual alignments are useful in some applications.

    A historical trivia note: This is the first (oldest) paper PubMed indexed using the search term Computational Biology. PubMed query
Return to Gusfield Homepage
November 2008