The size of electronic data is currently growing at a faster rate than computer memory and disk storage capacities. For this reason compression appears always as an attractive choice, if not mandatory. However, space overhead is not the only resource to be optimized when managing large data collections; in fact data turn out to be useful only when properly indexed to support search operations that efficiently extract the user-requested information. Approaches to combine compression and indexing techniques are nowadays receiving more and more attention. A first step towards the design of a compressed full-text index achieving guaranteed performance in the worst case has been recently done in [P. Ferragina, G. Manzini, in: Proceedings of the 41st IEEE Symposium on Foundations of Computer Science, Redondo Beach, CA, 2000, p. 390]. The novelty of the approach resides in the careful combination of the compression algorithm proposed by M. Burrows and D. Wheeler [A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994] with the suffix array data structure [U. Manber, G. Myers, SIAM J. Comput. 22 (5) (1993) p. 935]. The resulting index is opportunistic in that, although no assumption on a particular fixed distribution is made, it takes advantage of the compressibility of the input data by decreasing the space occupancy at no significant asymptotic slowdown in the query performance. In this paper we present an implementation of that index and perform an extensive set of experiments on various text collections. These experiments show that the proposed index is compact (its space occupancy is close to the one achieved by the best known compressors), it is fast in counting the number of pattern occurrences, and the cost of their retrieval is reasonable when they are few (i.e., in case of a selective query). In addition, our experiments show that the index is flexible in that it is possible to trade space occupancy for search time by choosing the amount of auxiliary information stored into it.

An Experimental Study of a Compressed Index

MANZINI, Giovanni
2001-01-01

Abstract

The size of electronic data is currently growing at a faster rate than computer memory and disk storage capacities. For this reason compression appears always as an attractive choice, if not mandatory. However, space overhead is not the only resource to be optimized when managing large data collections; in fact data turn out to be useful only when properly indexed to support search operations that efficiently extract the user-requested information. Approaches to combine compression and indexing techniques are nowadays receiving more and more attention. A first step towards the design of a compressed full-text index achieving guaranteed performance in the worst case has been recently done in [P. Ferragina, G. Manzini, in: Proceedings of the 41st IEEE Symposium on Foundations of Computer Science, Redondo Beach, CA, 2000, p. 390]. The novelty of the approach resides in the careful combination of the compression algorithm proposed by M. Burrows and D. Wheeler [A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994] with the suffix array data structure [U. Manber, G. Myers, SIAM J. Comput. 22 (5) (1993) p. 935]. The resulting index is opportunistic in that, although no assumption on a particular fixed distribution is made, it takes advantage of the compressibility of the input data by decreasing the space occupancy at no significant asymptotic slowdown in the query performance. In this paper we present an implementation of that index and perform an extensive set of experiments on various text collections. These experiments show that the proposed index is compact (its space occupancy is close to the one achieved by the best known compressors), it is fast in counting the number of pattern occurrences, and the cost of their retrieval is reasonable when they are few (i.e., in case of a selective query). In addition, our experiments show that the index is flexible in that it is possible to trade space occupancy for search time by choosing the amount of auxiliary information stored into it.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11579/6246
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 33
  • ???jsp.display-item.citation.isi??? 26
social impact