BWT-RLE FM-Index Backward Search

Overview

This project implements the FM-Index Backward Search Algorithm in C to efficiently search for words or phrases in a compressed, run-length encoded (RLE), and Burrows-Wheeler transformed (BWT) record file. The algorithm enables searching without needing to decode the file back to its original form.

File Encoding

The original record file is a plain text format:

[<offset>]<text>[<offset>]<text>[<offset>]<text>...

<offset>: Integer values serving as unique record identifiers.
<text>: Text values.

After BWT transformation, the file is run-length encoded to save storage space. For example, the BWT transformed text:

aAAbbbBBBBcccccCCCCCCdDDeeeEEEE

is abstracted to:

aAAb{0}B{1}c{2}C{3}dDDe{0}E{1}

Counts represent runs of characters, starting from a minimum of 3 (encoded as 0).
The first bit of each byte indicates if it represents an ASCII character (0) or a count (1).
Multi-byte counts are encoded using 7-bit blocks, starting from the least significant portion.

Example encoding:

01100001 10010011 10000001

represents:

a{147}

The encoded file is called the RLB file (.rlb).

Constraints

Index file is assumed to be persistent, and RLB files have unique names.
Search is case-sensitive, with search terms up to 512 characters.
Text includes visible ASCII characters (32-126), tab (9), and newline (10, 13).
Record offsets are positive, consecutive integers, but not necessarily starting from 0 or 1.
Record texts should not contain square brackets.
Record lengths must not exceed 5000 characters.
Query string must not include only numbers, square brackets, spaces, and double dashes.
RLB files must not be generated from source text files larger than 160MB.
Runtime memory usage must not exceed 13MB.
Index files must not exceed the size of RLB files.

General Usage

To compile the program, use make or:

gcc -O3 -o bwtsearch bwtsearch.c

To run the program, use:

./bwtsearch rlb_file idx_file "search_term"

rlb_file: Path to the BWT-RLE encoded file (.rlb).
idx_file: Path to the index file (.idx).
search_term: Quoted query string.

The program performs a backward search on the BWT-RLE file and outputs all matching records containing the query string to the standard output. If the index file does not exist, the program takes extra time to generate one.

To test the program with sample files in the data folder, use:

./bwtsearch data/small1.rlb data/small1.idx "ban"

Sample output:

[1]ban
[2]banana
[3]band
[4]bandage
Number of matches: 4

The output is sorted by the record identifiers in ascending numerical order and contains no duplicates.

Test Cases

You may try the following search terms:

Lite Test:

"http://opus.kobv.de/tuberlin/volltexte"
"Berlin"
"Institute"
"Technical"
"Munich"
"University"

Intensive Test:

".pdf"
"uni-"
"d-nb"
".info"
".de/"

Special Case Test:

"ewf ewkfhwke ewhh"
";"

Performance

In test cases with 5,000 matches, the program takes approximately 1-2 seconds on an SSD-equipped machine.
In test cases with 9,000 matches, the program takes about 5-8 seconds on an SSD-equipped machine.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
Makefile		Makefile
README.md		README.md
bwtsearch.c		bwtsearch.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BWT-RLE FM-Index Backward Search

Overview

File Encoding

Constraints

General Usage

Test Cases

Performance

About

Uh oh!

Releases

Packages

Languages

denniemok/bwt-rle-backward-search

Folders and files

Latest commit

History

Repository files navigation

BWT-RLE FM-Index Backward Search

Overview

File Encoding

Constraints

General Usage

Test Cases

Performance

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages