Normal view MARC view ISBD view

Full-text (substring) indexes in external memory

By: Barsky, Marina.

Contributor(s): Stege, Ulrike | Thomo, Alex.

Material type: materialTypeLabel

BookSeries: Synthesis digital library of engineering and computer science: ; Synthesis lectures on data management: # 22.Publisher: San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, c2012Description: 1 electronic text (xii, 76 p.) : ill., digital file.ISBN: 9781608457960 (electronic bk.).Subject(s): Text processing (Computer science) | Data structures (Computer science) | Computer algorithms | Magnetic memory (Computers) | full-text indexes | suffix trees | suffix arrays | external-memory algorithms | string pattern matchingDDC classification: 005 Online resources: Abstract with links to resource Also available in print.

Contents:

Preface -- Acknowledgments --

1. Structures for indexing substrings -- 1.1 Indexing substrings -- 1.2 Suffix array -- 1.3 Suffix trie -- 1.4 Suffix tree -- 1.5 Representation of a suffix tree -- 1.6 Possible solutions to the memory problem -- 1.6.1 Index compression -- 1.6.2 Using disk space -- 1.7 Summary -- 1.8 Bibliographic notes --

2. External construction of suffix trees -- 2.1 Transformation algorithms -- 2.2 Brute-force algorithms -- 2.3 Algorithms based on suffix-links -- 2.3.1 The Ukkonen algorithm -- 2.3.2 Distributed and paged suffix trees -- 2.4 Disk-optimized algorithms -- 2.4.1 The top down algorithm -- 2.4.2 The partition-and-merge algorithm -- 2.4.3 The merge sort algorithm -- 2.5 Summary -- 2.6 Bibliographic notes --

3. Scaling up:when the input exceeds the main memory -- 3.1 The wavefront algorithm -- 3.2 The B2ST algorithm -- 3.3 Summary -- 3.4 Bibliographic notes --

4. Queries for disk-based indexes -- 4.1 Index layouts -- 4.1.1 Partitioning by prefix -- 4.1.2 Partitioning by intervals -- 4.1.3 String B-tree -- 4.2 Pattern matching with disk-based indexes -- 4.2.1 Exact pattern matching -- 4.2.2 Matching all substrings of a query string -- 4.2.3 Approximate pattern matching -- 4.3 Repeating and unique substrings -- 4.3.1 Maximal repeats -- 4.3.2 Common and unique substrings -- 4.4 Summary -- 4.5 Bibliographic notes --

5. Conclusions and open problems -- 5.1 Need for better construction algorithms -- 5.2 Need for better query algorithms --

Bibliography -- Authors' biographies.

Abstract: Nowadays, textual databases are among the most rapidly growing collections of data. Some of these collections contain a new type of data that differs from classical numerical or textual data. These are long sequences of symbols, not divided into well-separated small tokens (words). The most prominent among such collections are databases of biological sequences, which are experiencing today an unprecedented growth rate. Starting in 2008, the "1000 Genomes Project" has been launched with the ultimate goal of collecting sequences of additional 1,500 Human genomes, 500 each of European, African, and East Asian origin. This will produce an extensive catalog of Human genetic variations. The size of just the raw sequences in this catalog would be about 5 terabytes.

average rating: 0.0 (0 votes)

Holdings ( 1 )
Title notes
Comments ( 0 )

Item type	Current location	Call number	Status	Date due	Barcode	Item holds
E books	PK Kelkar Library, IIT Kanpur		Available		EBKE392

Total holds: 0

Mode of access: World Wide Web.

System requirements: Adobe Acrobat Reader.

Part of: Synthesis digital library of engineering and computer science.

Series from website.

Includes bibliographical references (p. 71-74).

Preface -- Acknowledgments --

3. Scaling up:when the input exceeds the main memory -- 3.1 The wavefront algorithm -- 3.2 The B2ST algorithm -- 3.3 Summary -- 3.4 Bibliographic notes --

5. Conclusions and open problems -- 5.1 Need for better construction algorithms -- 5.2 Need for better query algorithms --

Bibliography -- Authors' biographies.

Abstract freely available; full-text restricted to subscribers or individual document purchasers.

Compendex

INSPEC

Google scholar

Google book search

Nowadays, textual databases are among the most rapidly growing collections of data. Some of these collections contain a new type of data that differs from classical numerical or textual data. These are long sequences of symbols, not divided into well-separated small tokens (words). The most prominent among such collections are databases of biological sequences, which are experiencing today an unprecedented growth rate. Starting in 2008, the "1000 Genomes Project" has been launched with the ultimate goal of collecting sequences of additional 1,500 Human genomes, 500 each of European, African, and East Asian origin. This will produce an extensive catalog of Human genetic variations. The size of just the raw sequences in this catalog would be about 5 terabytes.

Also available in print.

Title from PDF t.p. (viewed on January 12, 2012).

There are no comments for this item.

PKK Library

Welcome to P K Kelkar Library, Online Public Access Catalogue (OPAC)

Full-text (substring) indexes in external memory

By: Barsky, Marina.

Contributor(s): Stege, Ulrike | Thomo, Alex.