I have a 40MB (too big for memory in this case) list of strings that I want to do “begins with” queries on to extract matches. Anyone know of a good data structure for this? Bonus points for an existing os java implementation. I would be willing to sacrifice “begins with” to just exact matching if something already exists. A disk-based trie sounds ideal.
Advertisement
Answer
It looks like you need something like this: http://en.wikipedia.org/wiki/Trie
An implementation in Java can be found here, although it isn’t disk-based. I’ll keep searching :/
Useful papers: Trie methods for text and spatial data on secondary storage, B-tries for disk-based string management
Edit: I came across this might might be useful: MG4J: Managing Gigabytes for Java™