Title: | Generalizability Theory for Information Retrieval Evaluation |
---|---|
Description: | Provides tools to measure the reliability of an Information Retrieval test collection. It allows users to estimate reliability using Generalizability Theory and map those estimates onto well-known indicators such as Kendall tau correlation or sensitivity. |
Authors: | Julián Urbano [aut, cre] |
Maintainer: | Julián Urbano <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.0 |
Built: | 2025-02-26 05:00:41 UTC |
Source: | https://github.com/julian-urbano/gt4ireval |
This is the set of Average Precision scores of the 40 systems submitted to the TREC-3 Ad hoc track, evaluated over 50 topics.
adhoc3
adhoc3
A data frame with 40 columns (systems) and 50 rows (queries).
D. Harman (1994). Overview of the Third Text REtrieval Conference (TREC-3). Text REtrieval Conference.
dstudy
runs a D-study from the results of a gstudy
and computes, for a
certain number of queries, the expected generalizability coefficient Erho2
and index of
dependability Phi
, possibly with confidence intervals. Alternatively, it can estimate the
number of queries needed to achieve a certain level of stability, also with confidence intervals.
dstudy(gdata, queries = gdata$n.q, stability = 0.95, alpha = 0.025)
dstudy(gdata, queries = gdata$n.q, stability = 0.95, alpha = 0.025)
gdata |
The result of running a |
queries |
A vector with different query set sizes for which to estimate Erho2 and Phi.
Defaults to the number of queries used to compute |
stability |
A vector with target Erho2 and Phi values to estimate required query set sizes. |
alpha |
A vector of confidence levels to compute intervals for Erho2, Phi and query set
sizes. This is the probability on each side of the interval, so for a 90% confidence interval
one must set |
An object of class dstudy
, with the following components:
Erho2 , Erho2.lwr , Erho2.upr |
Expected generalizability coefficient, and lower and upper limits of the intervals around it. |
Phi , Phi.lwr , Phi.upr |
Expected index of dependability, and lower and upper limits of the intervals around it. |
n.q_Erho2 , n.q_Erho2.lwr , n.q_Erho2.upr |
Expected number of queries to achieve the generalizability coefficient, and lower and upper limits of the intervals around it. |
n.q_Phi , n.q_Phi.lwr , n.q_Phi.upr |
Expected number of queries to achieve the index of dependability, and lower and upper limits of the intervals around it. |
call |
A list with the gstudy used in this D-study, the target number of
queries , target level of stability and alpha level for the confidence
intervals. |
Julián Urbano
R.L. Brennan (2001). Generalizability Theory. Springer.
L.S. Feldt (1965). The Approximate Sampling Distribution of Kuder-Richardson Reliability Coefficient Twenty. Psychometrika, 30(3):357–370.
C. Arteaga, S. Jeyaratnam, and G. A. Franklin (1982). Confidence Intervals for Proportions of Total Variance in the Two-Way Cross Component of Variance Model. Communications in Statistics: Theory and Methods, 11(15):1643–1658.
J. Urbano, M. Marrero and D. Martín (2013). On the Measurement of Test Collection Reliability. ACM SIGIR, pp. 393-402.
g <- gstudy(adhoc3) dstudy(g) # estimate stability at various query set sizes dstudy(g, queries = seq(50, 200, 10)) # estimate required query set sizes for various stability levels dstudy(g, stability = seq(0.8, 0.95, 0.01)) # compute both 95% and 99% confidence intervals dstudy(g, stability = 0.9, alpha = c(0.05, 0.01) / 2) # compute 1-tailed 95% confidence intervals dstudy(g, alpha = 0.05)
g <- gstudy(adhoc3) dstudy(g) # estimate stability at various query set sizes dstudy(g, queries = seq(50, 200, 10)) # estimate required query set sizes for various stability levels dstudy(g, stability = seq(0.8, 0.95, 0.01)) # compute both 95% and 99% confidence intervals dstudy(g, stability = 0.9, alpha = c(0.05, 0.01) / 2) # compute 1-tailed 95% confidence intervals dstudy(g, alpha = 0.05)
gstudy
runs a G-study with the given data, assuming a fully crossed design (all systems
evaluated on the same queries). It can be used to estimate variance components, which can further
be used to run a D-study with dstudy
.
gstudy(data, drop = 0)
gstudy(data, drop = 0)
data |
A data frame or matrix with the existing effectiveness scores. Systems are columns and queries are rows. |
drop |
The fraction of worst-performing systems to drop from the data before analysis. Defaults to 0 (include all systems). |
An object of class gstudy
, with the following components:
n.s , n.q |
Number of systems and number of queries of the existing data. |
var.s , var.q , var.e |
Variance of the system, query, and residual effects. |
em.s , em.q , em.e |
Mean squares of the system, query and residual components. |
call |
A list with the existing data and the percentage of systems to
drop . |
Julián Urbano
R.L. Brennan (2001). Generalizability Theory. Springer.
J. Urbano, M. Marrero and D. Martín (2013). On the Measurement of Test Collection Reliability. ACM SIGIR, pp. 393-402.
g <- gstudy(adhoc3) # same, but drop the 20% worst systems g20 <- gstudy(adhoc3, drop = 0.2)
g <- gstudy(adhoc3) # same, but drop the 20% worst systems g20 <- gstudy(adhoc3, drop = 0.2)
Maps Erho2 and Phi scores from Generalizability Theory onto traditional data-based scores like the Kendall tau correlation, AP correlation, power, minor conflict rate and major conflict rate with 2-tailed t-tests, absolute and relative sensitivity, and rooted mean squared error.
gt2tau(Erho2) gt2tauAP(Erho2) gt2power(Erho2) gt2minor(Erho2) gt2major(Erho2) gt2asens(Erho2) gt2rsens(Phi) gt2rmse(Phi)
gt2tau(Erho2) gt2tauAP(Erho2) gt2power(Erho2) gt2minor(Erho2) gt2major(Erho2) gt2asens(Erho2) gt2rsens(Phi) gt2rmse(Phi)
Erho2 |
Vector of generalizability coefficients to map from. |
Phi |
Vector of indices of dependability to map from. |
Take these mappings with a grain of salt. See figure 3 in (Urbano, 20013).
A vector of data-based indicator values.
Julián Urbano
J. Urbano, M. Marrero and D. Martín (2013). On the Measurement of Test Collection Reliability. ACM SIGIR, pp. 393-402.
g <- gstudy(adhoc3) d <- dstudy(g) gt2tau(d$Erho2) gt2rmse(d$Phi)
g <- gstudy(adhoc3) d <- dstudy(g) gt2tau(d$Erho2) gt2rmse(d$Phi)
This is the Synthetic dataset no. 4 from Table 3.2 on page 73 of Brennan (2001), recasted as a p x i design, as required on page 182.
synthetic4
synthetic4
A data frame with 10 columns (systems) and 12 rows (queries).
R.L. Brennan, "Generalizability Theory". Springer, 2001.