Skip to content

This C tool implements the FM-Index Backward Search Algorithm to efficiently display matching records in BWT-RLE files without requiring decoding.

Notifications You must be signed in to change notification settings

denniemok/bwt-rle-backward-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

BWT-RLE FM-Index Backward Search

Overview

This project implements the FM-Index Backward Search Algorithm in C to efficiently search for words or phrases in a compressed, run-length encoded (RLE), and Burrows-Wheeler transformed (BWT) record file. The algorithm enables searching without needing to decode the file back to its original form.

File Encoding

The original record file is a plain text format:

[<offset>]<text>[<offset>]<text>[<offset>]<text>...
  • <offset>: Integer values serving as unique record identifiers.
  • <text>: Text values.

After BWT transformation, the file is run-length encoded to save storage space. For example, the BWT transformed text:

aAAbbbBBBBcccccCCCCCCdDDeeeEEEE

is abstracted to:

aAAb{0}B{1}c{2}C{3}dDDe{0}E{1}
  • Counts represent runs of characters, starting from a minimum of 3 (encoded as 0).
  • The first bit of each byte indicates if it represents an ASCII character (0) or a count (1).
  • Multi-byte counts are encoded using 7-bit blocks, starting from the least significant portion.

Example encoding:

01100001 10010011 10000001

represents:

a{147}

The encoded file is called the RLB file (.rlb).

Constraints

  • Index file is assumed to be persistent, and RLB files have unique names.
  • Search is case-sensitive, with search terms up to 512 characters.
  • Text includes visible ASCII characters (32-126), tab (9), and newline (10, 13).
  • Record offsets are positive, consecutive integers, but not necessarily starting from 0 or 1.
  • Record texts should not contain square brackets.
  • Record lengths must not exceed 5000 characters.
  • Query string must not include only numbers, square brackets, spaces, and double dashes.
  • RLB files must not be generated from source text files larger than 160MB.
  • Runtime memory usage must not exceed 13MB.
  • Index files must not exceed the size of RLB files.

General Usage

To compile the program, use make or:

gcc -O3 -o bwtsearch bwtsearch.c

To run the program, use:

./bwtsearch rlb_file idx_file "search_term"
  • rlb_file: Path to the BWT-RLE encoded file (.rlb).
  • idx_file: Path to the index file (.idx).
  • search_term: Quoted query string.

The program performs a backward search on the BWT-RLE file and outputs all matching records containing the query string to the standard output. If the index file does not exist, the program takes extra time to generate one.

To test the program with sample files in the data folder, use:

./bwtsearch data/small1.rlb data/small1.idx "ban"

Sample output:

[1]ban
[2]banana
[3]band
[4]bandage
Number of matches: 4

The output is sorted by the record identifiers in ascending numerical order and contains no duplicates.

Test Cases

You may try the following search terms:

Lite Test:

"http://opus.kobv.de/tuberlin/volltexte"
"Berlin"
"Institute"
"Technical"
"Munich"
"University"

Intensive Test:

".pdf"
"uni-"
"d-nb"
".info"
".de/"

Special Case Test:

"ewf ewkfhwke ewhh"
";"

Performance

  • In test cases with 5,000 matches, the program takes approximately 1-2 seconds on an SSD-equipped machine.
  • In test cases with 9,000 matches, the program takes about 5-8 seconds on an SSD-equipped machine.

About

This C tool implements the FM-Index Backward Search Algorithm to efficiently display matching records in BWT-RLE files without requiring decoding.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published