tarexp.evaluation module#

Consistent implementation of effectiveness metrics, including tricky issues like tiebreaking is critical to TAR experiments. This is true both for evaluation, and because stopping rules may incorporate effectiveness estimates based on small samples. We provide all metrics from the open source package ir-measures through the tarexp.workflow.Workflow.getMetrics() method. Metrics are computed on both the full collection and unreviewed documents to support both finite population and generalization perspectives.

In addition to standard IR metrics, TARexp implements OptimisticCost (tarexp.evaluation.OptimisticCost) to support the idealized end-to-end cost analysis for TAR proposed in Yang et al. [1]. Such analysis requires specifying a target recall and a cost structure associated with the TAR process. TARexp also provides helper functions for plotting cost dynamics graphs (tarexp.helper.plotting).

See also

class tarexp.evaluation.MeasureKey(measure: str | ir_measures.measures.Measure = None, section: str = 'all', target_recall: float = None)[source]#

Bases: object

Hashable key for evaluation metric.

measure: str | ir_measures.measures.Measure = None#

Name of the measurement.

section: str = 'all'#

Part of the collection that the evaluation is measured on. Can be but not limited to “all”, “known” (reviewed documents), etc.

target_recall: float = None#

The recall target. Can be None depending on whether the measure requires one.

class tarexp.evaluation.CostCountKey(measure: str | ir_measures.measures.Measure = None, section: str = 'all', target_recall: float = None, label: bool = None)[source]#

Bases: MeasureKey

Hashable key for the recording count of the documents.

measure#

Name of the measurement.

section#

Part of the collection that the evaluation is measured on. Can be but not limited to “all”, “known” (reviewed documents), etc.

target_recall#

The recall target. Can be None depending on whether the measure requires one.

label: bool = None#

The ground truth label that the measure is counting.

class tarexp.evaluation.OptimisticCost(measure: str | ir_measures.measures.Measure = None, section: str = 'all', target_recall: float = None, cost_structure: tuple[float, float, float, float] = None)[source]#

Bases: MeasureKey

Optimistic Cost

The cost measure that records the total cost of reviewing documents in both first and (an optimal) second phase review workflow. Please refer to Yang et al. [1] for further details.

measure#

Name of the measurement.

section#

Part of the collection that the evaluation is measured on. Can be but not limited to “all”, “known” (reviewed documents), etc.

target_recall#

The recall target. Can be None depending on whether the measure requires one.

cost_structure: tuple[float, float, float, float] = None#

Four-tuple cost structure. The elements of the tuple are the unit cost of reviewing positive and negative documents in the first phase and positive and negative ones in the second phase respectively.

static calc_all(measures, df)[source]#

Static method for calculating multiple OptimisticCost measures given a cost Pandas DataFrame.

The dataframe should contain “query_id”, “iteration”, “relevance” (as ground truth), “score”, “control” (boolean values of whether the document is in the control set), “known” (whether the document is reviewed) as columns and all documents in the collection as rows. This dataframe is similar to the one used in ir_measures.calc_all().

This method is also used internally by evaluate().

Parameters:
  • measures – A list of OptimisticCost instance for calculation.

  • df – The cost Pandas DataFrame

Returns:

A Python dictionary with key as the measures and values as the measurement values. The count of each section given the all recall targets provided in the measures argument would also be returned as auxiliary information.

Return type:

dict

tarexp.evaluation.evaluate(labels, ledger: Ledger, score, measures) Dict[MeasureKey, int | float][source]#

Evaluate TAR run based on a given tarexp.ledger.Ledger.

This function calculates the evaluation metrics based on the provided Ledger. The measures are evaluated on the last round recorded in the ledger. If inteded to caculate metrics on past rounds, please provide a ledger that only contains information up to the round the user is intended to evaluate by using tarexp.ledger.Ledger.freeze_at().

It serves as a catch-all function for all evaluation metrics TARexp supports, including all measurements in ir-measures and OptimisticCost. Future addtional of the supported evaluation metrics should also be added to this function for completeness.

Parameters:
  • labels – The ground-truth labels of the documents. This is different from the labels recorded in the Ledger which is the review results (not necessarily the groun-truth if not using a tarexp.components.labeler.PerfectLabeler)

  • ledger – The ledger the recorded the progress.

  • score – A list of the document scores.

  • measures – A list of MeasureKey instance, the name of the measurments supported in ir-measures, or ir-measures measurement object (such as ir_measures.P@10).

Returns:

A Python dictionary of keys as instances of MeasureKey and values as the corresponding measurement values.

Return type:

dict[MeasureKey, int|float]