Welcome to P K Kelkar Library, Online Public Access Catalogue (OPAC)

Normal view MARC view ISBD view

Information retrieval models : foundations and relationships /

By: Roelleke, Thomas.
Material type: materialTypeLabelBookSeries: Synthesis digital library of engineering and computer science: ; Synthesis lectures on information concepts, retrieval, and services: # 27.Publisher: San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, c2013Description: 1 electronic text (xxi, 141 p.) : ill., digital file.ISBN: 9781627050791 (electronic bk.).Subject(s): Information retrieval -- Mathematical models | Information Retrieval (IR) Models | Foundations & Relationships | TF-IDF | probability of relevance framework (PRF) | Poisson | BM25 | language modelling (LM) | divergence from randomness (DFR) | probabilistic roots of IR modelsDDC classification: 025.04 Online resources: Abstract with links to resource | Abstract with links to full text Also available in print.
Contents:
1. Introduction -- 1.1 Structure and contribution of this book -- 1.2 Background: a timeline of IR models -- 1.3 Notation -- 1.3.1 The notation issue "term frequency" -- 1.3.2 Notation: Zhai's book and this book --
2. Foundations of IR models -- 2.1 TF-IDF -- 2.1.1 TF variants -- 2.1.2 TFlog: Logarithmic TF -- 2.1.3 TFfrac: fractional (ratio-based) TF -- 2.1.4 IDF variants -- 2.1.5 Term weight and RSV -- 2.1.6 Other TF variants: lifted TF and pivoted TF -- 2.1.7 Semi-subsumed event occurrences: a semantics of the BM25-TF -- 2.1.8 Probabilistic IDF: The probability of being informative -- 2.1.9 Summary -- 2.2 PRF: the probability of relevance framework -- 2.2.1 Feature independence assumption -- 2.2.2 Non-query term assumption -- 2.2.3 Term frequency split -- 2.2.4 Probability ranking principle (PRP) -- 2.2.5 Summary -- 2.3 BIR: binary independence retrieval -- 2.3.1 Term weight and RSV -- 2.3.2 Missing relevance information -- 2.3.3 Variants of the BIR term weight -- 2.3.4 Smooth variants of the BIR term weight -- 2.3.5 RSJ term weight -- 2.3.6 On theoretical arguments for 0.5 in the RSJ term weight -- 2.3.7 Summary -- 2.4 Poisson and 2-Poisson -- 2.4.1 Poisson probability -- 2.4.2 Poisson analogy: sunny days and term occurrences -- 2.4.3 Poisson example: toy data -- 2.4.4 Poisson example: TREC-2 -- 2.4.5 Binomial probability -- 2.4.6 Relationship between Poisson and binomial probability -- 2.4.7 Poisson PRF -- 2.4.8 Term weight and RSV -- 2.4.9 2-Poisson -- 2.4.10 Summary -- 2.5 BM25 -- 2.5.1 BM25-TF -- 2.5.2 BM25-TF and pivoted TF -- 2.5.3 BM25: literature and Wikipedia end 2012 -- 2.5.4 Term weight and RSV -- 2.5.5 Summary -- 2.6 LM: language modeling -- 2.6.1 Probability mixtures -- 2.6.2 Term weight and RSV: LM1 -- 2.6.3 Term weight and RSV: LM (normalized) -- 2.6.4 Term weight and RSV: JM-LM -- 2.6.5 Term weight and RSV: Dirich-LM -- 2.6.6 Term weight and RSV: LM2 -- 2.6.7 Summary -- 2.7 PIN's: probabilistic inference networks -- 2.7.1 The Turtle/Croft link matrix -- 2.7.2 Term weight and RSV -- 2.7.3 Summary -- 2.8 Divergence-based models and DFR -- 2.8.1 DFR: divergence from randomness -- 2.8.2 DFR: sampling over documents and locations -- 2.8.3 DFR: binomial transformation step -- 2.8.4 DFR and KL-divergence -- 2.8.5 Poisson as a model of randomness: P(Kt [greater than] 0/d,c): DFR-1 -- 2.8.6 Poisson as a model of randomness: P(Kt [equals] TFd/d,c): DFR-2 -- 2.8.7 DFR: elite documents -- 2.8.8 DFR: example -- 2.8.9 Term weights and RSV's -- 2.8.10 KL-divergence retrieval model -- 2.8.11 Summary -- 2.9 Relevance-based models -- 2.9.1 Rocchio's relevance feedback model -- 2.9.2 The PRF -- 2.9.3 Lavrenko's relevance-based language models -- 2.10 Precision and recall -- 2.10.1 Precision and recall: conditional probabilities -- 2.10.2 Averages: total probabilities -- 2.11 Summary --
3. Relationships between IR models -- 3.1 PRF: the probability of relevance framework -- 3.1.1 Estimation of term probabilities -- 3.2 P(d - q): the probability that d implies q -- 3.3 The vector-space model (VSM) -- 3.3.1 VSM and probabilities -- 3.4 The generalised vector-space model (GVSM) -- 3.4.1 GVSM and probabilities -- 3.5 A general matrix framework -- 3.5.1 Term-document matrix -- 3.5.2 On the notation issue "term frequency" -- 3.5.3 Document-document matrix -- 3.5.4 Co-occurrence matrices -- 3.6 A parallel derivation of probabilistic retrieval models -- 3.7 The Poisson bridge: Pd(t/u) avgtf(t,u) [equals] PL(t/u) avgdl(u) -- 3.8 Query term probability assumptions -- 3.8.1 Query term mixture assumption -- 3.8.2 Query term burstiness assumption -- 3.8.3 Query term BIR assumption -- 3.9 TF-IDF -- 3.9.1 TF-IDF and BIR -- 3.9.2 TF-IDF and Poisson -- 3.9.3 TF-IDF and BM25 -- 3.9.4 TF-IDF and LM -- 3.9.5 TF-IDF and LM: side-by-side -- 3.9.6 TF-IDF and PIN's -- 3.9.7 TF-IDF and divergence -- 3.9.8 TF-IDF and DFR: risk times gain -- 3.9.9 TF-IDF and DFR: gaps between term occurrences -- 3.10 More relationships: BM25 and LM, LM and PIN's -- 3.11 Information theory -- 3.11.1 Entropy -- 3.11.2 Joint entropy -- 3.11.3 Conditional entropy -- 3.11.4 Mutual information (MI) -- 3.11.5 Cross entropy -- 3.11.6 KL-divergence -- 3.11.7 Query clarity: divergence(query collection) -- 3.11.8 LM = Clarity(query) - Divergence(query doc) -- 3.11.9 TF-IDF = Clarity(doc) - Divergence(doc query) -- 3.12 Summary --
4. Summary & research outlook -- 4.1 Summary -- 4.2 Research outlook -- 4.2.1 Retrieval models -- 4.2.2 Evaluation models -- 4.2.3 A unified framework for retrieval and evaluation -- 4.2.4 Model combinations and "new" models -- 4.2.5 Dependence-aware models -- 4.2.6 "Query-log" and other more-evidence models -- 4.2.7 Phase-2 models: retrieval result condensation models -- 4.2.8 A theoretical framework to predict ranking quality -- 4.2.9 MIR: math for IR -- 4.2.10 AIR: abstraction for IR --
Bibliography -- Author's biography -- Index.
Abstract: Information Retrieval (IR) models are a core component of IR research and IR systems. The past decade brought a consolidation of the family of IR models, which by 2000 consisted of relatively isolated views on TF-IDF (Term-Frequency times Inverse-Document-Frequency) as the weighting scheme in the vector-space model (VSM), the probabilistic relevance framework (PRF), the binary independence retrieval (BIR) model, BM25 (Best-Match Version 25, the main instantiation of the PRF/BIR), and language modelling (LM). Also, the early 2000s saw the arrival of divergence from randomness (DFR). Regarding intuition and simplicity, though LM is clear from a probabilistic point of view, several people stated: "It is easy to understand TF-IDF and BM25. For LM, however, we understand the math, but we do not fully understand why it works." This book takes a horizontal approach gathering the foundations of TF-IDF, PRF, BIR, Poisson, BM25, LM, probabilistic inference networks (PIN's), and divergence-based models. The aim is to create a consolidated and balanced view on the main models. A particular focus of this book is on the "relationships between models." This includes an overview over the main frameworks (PRF, logical IR, VSM, generalized VSM) and a pairing of TF-IDF with other models. It becomes evident that TF-IDF and LM measure the same, namely the dependence (overlap) between document and query. The Poisson probability helps to establish probabilistic, non-heuristic roots for TF-IDF, and the Poisson parameter, average term frequency, is a binding link between several retrieval models and model parameters.
    average rating: 0.0 (0 votes)
Item type Current location Call number Status Date due Barcode Item holds
E books E books PK Kelkar Library, IIT Kanpur
Available EBKE510
Total holds: 0

Mode of access: World Wide Web.

System requirements: Adobe Acrobat Reader.

Part of: Synthesis digital library of engineering and computer science.

Series from website.

Includes bibliographical references (p. 127-134) and index.

1. Introduction -- 1.1 Structure and contribution of this book -- 1.2 Background: a timeline of IR models -- 1.3 Notation -- 1.3.1 The notation issue "term frequency" -- 1.3.2 Notation: Zhai's book and this book --

2. Foundations of IR models -- 2.1 TF-IDF -- 2.1.1 TF variants -- 2.1.2 TFlog: Logarithmic TF -- 2.1.3 TFfrac: fractional (ratio-based) TF -- 2.1.4 IDF variants -- 2.1.5 Term weight and RSV -- 2.1.6 Other TF variants: lifted TF and pivoted TF -- 2.1.7 Semi-subsumed event occurrences: a semantics of the BM25-TF -- 2.1.8 Probabilistic IDF: The probability of being informative -- 2.1.9 Summary -- 2.2 PRF: the probability of relevance framework -- 2.2.1 Feature independence assumption -- 2.2.2 Non-query term assumption -- 2.2.3 Term frequency split -- 2.2.4 Probability ranking principle (PRP) -- 2.2.5 Summary -- 2.3 BIR: binary independence retrieval -- 2.3.1 Term weight and RSV -- 2.3.2 Missing relevance information -- 2.3.3 Variants of the BIR term weight -- 2.3.4 Smooth variants of the BIR term weight -- 2.3.5 RSJ term weight -- 2.3.6 On theoretical arguments for 0.5 in the RSJ term weight -- 2.3.7 Summary -- 2.4 Poisson and 2-Poisson -- 2.4.1 Poisson probability -- 2.4.2 Poisson analogy: sunny days and term occurrences -- 2.4.3 Poisson example: toy data -- 2.4.4 Poisson example: TREC-2 -- 2.4.5 Binomial probability -- 2.4.6 Relationship between Poisson and binomial probability -- 2.4.7 Poisson PRF -- 2.4.8 Term weight and RSV -- 2.4.9 2-Poisson -- 2.4.10 Summary -- 2.5 BM25 -- 2.5.1 BM25-TF -- 2.5.2 BM25-TF and pivoted TF -- 2.5.3 BM25: literature and Wikipedia end 2012 -- 2.5.4 Term weight and RSV -- 2.5.5 Summary -- 2.6 LM: language modeling -- 2.6.1 Probability mixtures -- 2.6.2 Term weight and RSV: LM1 -- 2.6.3 Term weight and RSV: LM (normalized) -- 2.6.4 Term weight and RSV: JM-LM -- 2.6.5 Term weight and RSV: Dirich-LM -- 2.6.6 Term weight and RSV: LM2 -- 2.6.7 Summary -- 2.7 PIN's: probabilistic inference networks -- 2.7.1 The Turtle/Croft link matrix -- 2.7.2 Term weight and RSV -- 2.7.3 Summary -- 2.8 Divergence-based models and DFR -- 2.8.1 DFR: divergence from randomness -- 2.8.2 DFR: sampling over documents and locations -- 2.8.3 DFR: binomial transformation step -- 2.8.4 DFR and KL-divergence -- 2.8.5 Poisson as a model of randomness: P(Kt [greater than] 0/d,c): DFR-1 -- 2.8.6 Poisson as a model of randomness: P(Kt [equals] TFd/d,c): DFR-2 -- 2.8.7 DFR: elite documents -- 2.8.8 DFR: example -- 2.8.9 Term weights and RSV's -- 2.8.10 KL-divergence retrieval model -- 2.8.11 Summary -- 2.9 Relevance-based models -- 2.9.1 Rocchio's relevance feedback model -- 2.9.2 The PRF -- 2.9.3 Lavrenko's relevance-based language models -- 2.10 Precision and recall -- 2.10.1 Precision and recall: conditional probabilities -- 2.10.2 Averages: total probabilities -- 2.11 Summary --

3. Relationships between IR models -- 3.1 PRF: the probability of relevance framework -- 3.1.1 Estimation of term probabilities -- 3.2 P(d - q): the probability that d implies q -- 3.3 The vector-space model (VSM) -- 3.3.1 VSM and probabilities -- 3.4 The generalised vector-space model (GVSM) -- 3.4.1 GVSM and probabilities -- 3.5 A general matrix framework -- 3.5.1 Term-document matrix -- 3.5.2 On the notation issue "term frequency" -- 3.5.3 Document-document matrix -- 3.5.4 Co-occurrence matrices -- 3.6 A parallel derivation of probabilistic retrieval models -- 3.7 The Poisson bridge: Pd(t/u) avgtf(t,u) [equals] PL(t/u) avgdl(u) -- 3.8 Query term probability assumptions -- 3.8.1 Query term mixture assumption -- 3.8.2 Query term burstiness assumption -- 3.8.3 Query term BIR assumption -- 3.9 TF-IDF -- 3.9.1 TF-IDF and BIR -- 3.9.2 TF-IDF and Poisson -- 3.9.3 TF-IDF and BM25 -- 3.9.4 TF-IDF and LM -- 3.9.5 TF-IDF and LM: side-by-side -- 3.9.6 TF-IDF and PIN's -- 3.9.7 TF-IDF and divergence -- 3.9.8 TF-IDF and DFR: risk times gain -- 3.9.9 TF-IDF and DFR: gaps between term occurrences -- 3.10 More relationships: BM25 and LM, LM and PIN's -- 3.11 Information theory -- 3.11.1 Entropy -- 3.11.2 Joint entropy -- 3.11.3 Conditional entropy -- 3.11.4 Mutual information (MI) -- 3.11.5 Cross entropy -- 3.11.6 KL-divergence -- 3.11.7 Query clarity: divergence(query collection) -- 3.11.8 LM = Clarity(query) - Divergence(query doc) -- 3.11.9 TF-IDF = Clarity(doc) - Divergence(doc query) -- 3.12 Summary --

4. Summary & research outlook -- 4.1 Summary -- 4.2 Research outlook -- 4.2.1 Retrieval models -- 4.2.2 Evaluation models -- 4.2.3 A unified framework for retrieval and evaluation -- 4.2.4 Model combinations and "new" models -- 4.2.5 Dependence-aware models -- 4.2.6 "Query-log" and other more-evidence models -- 4.2.7 Phase-2 models: retrieval result condensation models -- 4.2.8 A theoretical framework to predict ranking quality -- 4.2.9 MIR: math for IR -- 4.2.10 AIR: abstraction for IR --

Bibliography -- Author's biography -- Index.

Abstract freely available; full-text restricted to subscribers or individual document purchasers.

Compendex

INSPEC

Google scholar

Google book search

Information Retrieval (IR) models are a core component of IR research and IR systems. The past decade brought a consolidation of the family of IR models, which by 2000 consisted of relatively isolated views on TF-IDF (Term-Frequency times Inverse-Document-Frequency) as the weighting scheme in the vector-space model (VSM), the probabilistic relevance framework (PRF), the binary independence retrieval (BIR) model, BM25 (Best-Match Version 25, the main instantiation of the PRF/BIR), and language modelling (LM). Also, the early 2000s saw the arrival of divergence from randomness (DFR). Regarding intuition and simplicity, though LM is clear from a probabilistic point of view, several people stated: "It is easy to understand TF-IDF and BM25. For LM, however, we understand the math, but we do not fully understand why it works." This book takes a horizontal approach gathering the foundations of TF-IDF, PRF, BIR, Poisson, BM25, LM, probabilistic inference networks (PIN's), and divergence-based models. The aim is to create a consolidated and balanced view on the main models. A particular focus of this book is on the "relationships between models." This includes an overview over the main frameworks (PRF, logical IR, VSM, generalized VSM) and a pairing of TF-IDF with other models. It becomes evident that TF-IDF and LM measure the same, namely the dependence (overlap) between document and query. The Poisson probability helps to establish probabilistic, non-heuristic roots for TF-IDF, and the Poisson parameter, average term frequency, is a binding link between several retrieval models and model parameters.

Also available in print.

Title from PDF t.p. (viewed on August 14, 2013).

There are no comments for this item.

Log in to your account to post a comment.

Powered by Koha