This project implements the FM-Index Backward Search Algorithm in C to efficiently search for words or phrases in a compressed, run-length encoded (RLE), and Burrows-Wheeler transformed (BWT) record file. The algorithm enables searching without needing to decode the file back to its original form.
The original record file is a plain text format:
[<offset>]<text>[<offset>]<text>[<offset>]<text>...
<offset>: Integer values serving as unique record identifiers.<text>: Text values.
After BWT transformation, the file is run-length encoded to save storage space. For example, the BWT transformed text:
aAAbbbBBBBcccccCCCCCCdDDeeeEEEE
is abstracted to:
aAAb{0}B{1}c{2}C{3}dDDe{0}E{1}
- Counts represent runs of characters, starting from a minimum of 3 (encoded as 0).
- The first bit of each byte indicates if it represents an ASCII character (0) or a count (1).
- Multi-byte counts are encoded using 7-bit blocks, starting from the least significant portion.
Example encoding:
01100001 10010011 10000001
represents:
a{147}
The encoded file is called the RLB file (.rlb).
- Index file is assumed to be persistent, and RLB files have unique names.
- Search is case-sensitive, with search terms up to 512 characters.
- Text includes visible ASCII characters (32-126), tab (9), and newline (10, 13).
- Record offsets are positive, consecutive integers, but not necessarily starting from 0 or 1.
- Record texts should not contain square brackets.
- Record lengths must not exceed 5000 characters.
- Query string must not include only numbers, square brackets, spaces, and double dashes.
- RLB files must not be generated from source text files larger than 160MB.
- Runtime memory usage must not exceed 13MB.
- Index files must not exceed the size of RLB files.
To compile the program, use make or:
gcc -O3 -o bwtsearch bwtsearch.cTo run the program, use:
./bwtsearch rlb_file idx_file "search_term"rlb_file: Path to the BWT-RLE encoded file (.rlb).idx_file: Path to the index file (.idx).search_term: Quoted query string.
The program performs a backward search on the BWT-RLE file and outputs all matching records containing the query string to the standard output. If the index file does not exist, the program takes extra time to generate one.
To test the program with sample files in the data folder, use:
./bwtsearch data/small1.rlb data/small1.idx "ban"Sample output:
[1]ban
[2]banana
[3]band
[4]bandage
Number of matches: 4The output is sorted by the record identifiers in ascending numerical order and contains no duplicates.
You may try the following search terms:
Lite Test:
"http://opus.kobv.de/tuberlin/volltexte"
"Berlin"
"Institute"
"Technical"
"Munich"
"University"Intensive Test:
".pdf"
"uni-"
"d-nb"
".info"
".de/"Special Case Test:
"ewf ewkfhwke ewhh"
";"- In test cases with 5,000 matches, the program takes approximately 1-2 seconds on an SSD-equipped machine.
- In test cases with 9,000 matches, the program takes about 5-8 seconds on an SSD-equipped machine.