Package 'gt4ireval' reference manual

Title:	Generalizability Theory for Information Retrieval Evaluation
Description:	Provides tools to measure the reliability of an Information Retrieval test collection. It allows users to estimate reliability using Generalizability Theory and map those estimates onto well-known indicators such as Kendall tau correlation or sensitivity.
Authors:	Julián Urbano [aut, cre]
Maintainer:	Julián Urbano <[email protected]>
License:	MIT + file LICENSE
Version:	2.0
Built:	2025-03-28 05:05:51 UTC
Source:	https://github.com/julian-urbano/gt4ireval

TREC-3 Ad hoc track.

Description

This is the set of Average Precision scores of the 40 systems submitted to the TREC-3 Ad hoc track, evaluated over 50 topics.

Usage

adhoc3
adhoc3

Format

A data frame with 40 columns (systems) and 50 rows (queries).

References

D. Harman (1994). Overview of the Third Text REtrieval Conference (TREC-3). Text REtrieval Conference.

dstudy runs a D-study from the results of a gstudy and computes, for a certain number of queries, the expected generalizability coefficient Erho2 and index of dependability Phi, possibly with confidence intervals. Alternatively, it can estimate the number of queries needed to achieve a certain level of stability, also with confidence intervals.

Usage

dstudy(gdata, queries = gdata$n.q, stability = 0.95, alpha = 0.025)
dstudy(gdata, queries = gdata$n.q, stability = 0.95, alpha = 0.025)

Arguments

`gdata`	The result of running a `gstudy` with existing data.
`queries`	A vector with different query set sizes for which to estimate Erho2 and Phi. Defaults to the number of queries used to compute `gdata`.
`stability`	A vector with target Erho2 and Phi values to estimate required query set sizes.
`alpha`	A vector of confidence levels to compute intervals for Erho2, Phi and query set sizes. This is the probability on each side of the interval, so for a 90% confidence interval one must set `alpha` to 0.05.

Value

An object of class dstudy, with the following components:

`Erho2`, `Erho2.lwr`, `Erho2.upr`	Expected generalizability coefficient, and lower and upper limits of the intervals around it.
`Phi`, `Phi.lwr`, `Phi.upr`	Expected index of dependability, and lower and upper limits of the intervals around it.
`n.q_Erho2`, `n.q_Erho2.lwr`, `n.q_Erho2.upr`	Expected number of queries to achieve the generalizability coefficient, and lower and upper limits of the intervals around it.
`n.q_Phi`, `n.q_Phi.lwr`, `n.q_Phi.upr`	Expected number of queries to achieve the index of dependability, and lower and upper limits of the intervals around it.
`call`	A list with the `gstudy` used in this D-study, the target number of `queries`, target level of `stability` and `alpha` level for the confidence intervals.

Author(s)

Julián Urbano

References

R.L. Brennan (2001). Generalizability Theory. Springer.

L.S. Feldt (1965). The Approximate Sampling Distribution of Kuder-Richardson Reliability Coefficient Twenty. Psychometrika, 30(3):357–370.

C. Arteaga, S. Jeyaratnam, and G. A. Franklin (1982). Confidence Intervals for Proportions of Total Variance in the Two-Way Cross Component of Variance Model. Communications in Statistics: Theory and Methods, 11(15):1643–1658.

J. Urbano, M. Marrero and D. Martín (2013). On the Measurement of Test Collection Reliability. ACM SIGIR, pp. 393-402.

Examples

g <- gstudy(adhoc3)
dstudy(g)

# estimate stability at various query set sizes
dstudy(g, queries = seq(50, 200, 10))
# estimate required query set sizes for various stability levels
dstudy(g, stability = seq(0.8, 0.95, 0.01))
# compute both 95% and 99% confidence intervals
dstudy(g, stability = 0.9, alpha = c(0.05, 0.01) / 2)
# compute 1-tailed 95% confidence intervals
dstudy(g, alpha = 0.05)
g <- gstudy(adhoc3)
dstudy(g)

# estimate stability at various query set sizes
dstudy(g, queries = seq(50, 200, 10))
# estimate required query set sizes for various stability levels
dstudy(g, stability = seq(0.8, 0.95, 0.01))
# compute both 95% and 99% confidence intervals
dstudy(g, stability = 0.9, alpha = c(0.05, 0.01) / 2)
# compute 1-tailed 95% confidence intervals
dstudy(g, alpha = 0.05)

G-study (Generalizability)

Description

gstudy runs a G-study with the given data, assuming a fully crossed design (all systems evaluated on the same queries). It can be used to estimate variance components, which can further be used to run a D-study with dstudy.

Usage

gstudy(data, drop = 0)
gstudy(data, drop = 0)

Arguments

`data`	A data frame or matrix with the existing effectiveness scores. Systems are columns and queries are rows.
`drop`	The fraction of worst-performing systems to drop from the data before analysis. Defaults to 0 (include all systems).

Value

An object of class gstudy, with the following components:

`n.s`, `n.q`	Number of systems and number of queries of the existing data.
`var.s`, `var.q`, `var.e`	Variance of the system, query, and residual effects.
`em.s`, `em.q`, `em.e`	Mean squares of the system, query and residual components.
`call`	A list with the existing `data` and the percentage of systems to `drop`.

Author(s)

Julián Urbano

References

R.L. Brennan (2001). Generalizability Theory. Springer.

J. Urbano, M. Marrero and D. Martín (2013). On the Measurement of Test Collection Reliability. ACM SIGIR, pp. 393-402.

Examples

g <- gstudy(adhoc3)

# same, but drop the 20% worst systems
g20 <- gstudy(adhoc3, drop = 0.2)
g <- gstudy(adhoc3)

# same, but drop the 20% worst systems
g20 <- gstudy(adhoc3, drop = 0.2)

Map GT-based Indicators onto Data-based Indicators

Description

Maps Erho2 and Phi scores from Generalizability Theory onto traditional data-based scores like the Kendall tau correlation, AP correlation, power, minor conflict rate and major conflict rate with 2-tailed t-tests, absolute and relative sensitivity, and rooted mean squared error.

Usage

gt2tau(Erho2)

gt2tauAP(Erho2)

gt2power(Erho2)

gt2minor(Erho2)

gt2major(Erho2)

gt2asens(Erho2)

gt2rsens(Phi)

gt2rmse(Phi)
gt2tau(Erho2)

gt2tauAP(Erho2)

gt2power(Erho2)

gt2minor(Erho2)

gt2major(Erho2)

gt2asens(Erho2)

gt2rsens(Phi)

gt2rmse(Phi)

Arguments

`Erho2`	Vector of generalizability coefficients to map from.
`Phi`	Vector of indices of dependability to map from.

Details

Take these mappings with a grain of salt. See figure 3 in (Urbano, 20013).

Value

A vector of data-based indicator values.

Author(s)

Julián Urbano

References

J. Urbano, M. Marrero and D. Martín (2013). On the Measurement of Test Collection Reliability. ACM SIGIR, pp. 393-402.

Examples

g <- gstudy(adhoc3)
d <- dstudy(g)
gt2tau(d$Erho2)
gt2rmse(d$Phi)

g <- gstudy(adhoc3)
d <- dstudy(g)
gt2tau(d$Erho2)
gt2rmse(d$Phi)

Synthetic dataset no. 4.

Description

This is the Synthetic dataset no. 4 from Table 3.2 on page 73 of Brennan (2001), recasted as a p x i design, as required on page 182.

Usage

synthetic4
synthetic4

Format

A data frame with 10 columns (systems) and 12 rows (queries).

References

R.L. Brennan, "Generalizability Theory". Springer, 2001.

Package 'gt4ireval'

Help Index

TREC-3 Ad hoc track.

Description

Usage

Format

References

See Also

D-study (Decision)

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

G-study (Generalizability)

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Map GT-based Indicators onto Data-based Indicators

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Synthetic dataset no. 4.

Description

Usage

Format

References