tarexp.evaluation module#
Consistent implementation of effectiveness metrics, including tricky issues like tiebreaking is critical to TAR experiments.
This is true both for evaluation, and because stopping rules may incorporate effectiveness estimates based on small samples.
We provide all metrics from the open source package ir-measures
through the tarexp.workflow.Workflow.getMetrics()
method.
Metrics are computed on both the full collection and unreviewed documents to support both finite population and generalization
perspectives.
In addition to standard IR metrics, TARexp
implements OptimisticCost (tarexp.evaluation.OptimisticCost
) to
support the idealized end-to-end cost analysis for TAR proposed in Yang et al. [1]. Such analysis requires specifying a
target recall and a cost structure associated with the TAR process. TARexp
also provides helper functions for plotting
cost dynamics graphs (tarexp.helper.plotting
).
See also
- class tarexp.evaluation.MeasureKey(measure: str | ir_measures.measures.Measure = None, section: str = 'all', target_recall: float = None)[source]#
Bases:
object
Hashable key for evaluation metric.
- measure: str | ir_measures.measures.Measure = None#
Name of the measurement.
- section: str = 'all'#
Part of the collection that the evaluation is measured on. Can be but not limited to “all”, “known” (reviewed documents), etc.
- target_recall: float = None#
The recall target. Can be
None
depending on whether the measure requires one.
- class tarexp.evaluation.CostCountKey(measure: str | ir_measures.measures.Measure = None, section: str = 'all', target_recall: float = None, label: bool = None)[source]#
Bases:
MeasureKey
Hashable key for the recording count of the documents.
- measure#
Name of the measurement.
- section#
Part of the collection that the evaluation is measured on. Can be but not limited to “all”, “known” (reviewed documents), etc.
- target_recall#
The recall target. Can be
None
depending on whether the measure requires one.
- label: bool = None#
The ground truth label that the measure is counting.
- class tarexp.evaluation.OptimisticCost(measure: str | ir_measures.measures.Measure = None, section: str = 'all', target_recall: float = None, cost_structure: tuple[float, float, float, float] = None)[source]#
Bases:
MeasureKey
Optimistic Cost
The cost measure that records the total cost of reviewing documents in both first and (an optimal) second phase review workflow. Please refer to Yang et al. [1] for further details.
- measure#
Name of the measurement.
- section#
Part of the collection that the evaluation is measured on. Can be but not limited to “all”, “known” (reviewed documents), etc.
- target_recall#
The recall target. Can be
None
depending on whether the measure requires one.
- cost_structure: tuple[float, float, float, float] = None#
Four-tuple cost structure. The elements of the tuple are the unit cost of reviewing positive and negative documents in the first phase and positive and negative ones in the second phase respectively.
- static calc_all(measures, df)[source]#
Static method for calculating multiple
OptimisticCost
measures given a cost Pandas DataFrame.The dataframe should contain “query_id”, “iteration”, “relevance” (as ground truth), “score”, “control” (boolean values of whether the document is in the control set), “known” (whether the document is reviewed) as columns and all documents in the collection as rows. This dataframe is similar to the one used in
ir_measures.calc_all()
.This method is also used internally by
evaluate()
.- Parameters:
measures – A list of
OptimisticCost
instance for calculation.df – The cost Pandas DataFrame
- Returns:
A Python dictionary with key as the measures and values as the measurement values. The count of each section given the all recall targets provided in the
measures
argument would also be returned as auxiliary information.- Return type:
dict
- tarexp.evaluation.evaluate(labels, ledger: Ledger, score, measures) Dict[MeasureKey, int | float] [source]#
Evaluate TAR run based on a given
tarexp.ledger.Ledger
.This function calculates the evaluation metrics based on the provided Ledger. The measures are evaluated on the last round recorded in the ledger. If inteded to caculate metrics on past rounds, please provide a ledger that only contains information up to the round the user is intended to evaluate by using
tarexp.ledger.Ledger.freeze_at()
.It serves as a catch-all function for all evaluation metrics
TARexp
supports, including all measurements inir-measures
andOptimisticCost
. Future addtional of the supported evaluation metrics should also be added to this function for completeness.- Parameters:
labels – The ground-truth labels of the documents. This is different from the labels recorded in the Ledger which is the review results (not necessarily the groun-truth if not using a
tarexp.components.labeler.PerfectLabeler
)ledger – The ledger the recorded the progress.
score – A list of the document scores.
measures – A list of
MeasureKey
instance, the name of the measurments supported inir-measures
, orir-measures
measurement object (such asir_measures.P@10
).
- Returns:
A Python dictionary of keys as instances of
MeasureKey
and values as the corresponding measurement values.- Return type:
dict[
MeasureKey
, int|float]