tarexp.ledger module#
Any aspect of the history of a batch-based workflow can, if necessary, be reproduced from a record of
which documents were labeled on which training rounds (including any initial seed round).
The tarexp.ledger.Ledger
instance records this state in memory, and
writes it to disk at user-specified intervals to enable restarts (specified in tarexp.workflow.Workflow
).
The persisted ledger for a complete run can be used to execute TARexp
in frozen mode (tarexp.ledger.FrozenLedger
)
where no batch selection, training, or scoring is done. Frozen mode supports efficient testing of new components that
do not change training or scoring, e.g., non-interventional stopping rules [1], effectiveness estimation methods, etc.
Evaluating stopping rules for two-phase reviews also requires persisting scores of all documents at the end of each
training round, an option the user can specify.
See also
- class tarexp.ledger.Ledger(n_docs: int)[source]#
Bases:
Savable
A
Ledger
records the progress of a TAR run.Specifically, we record (1) the review result of each document and (2) the round each document being reviewed. Control set documents are marked as reviewed at round -1 and seed documents are at round 0.
- Parameters:
n_docs – Number of documents in the collection.
- createControl(*args)[source]#
Create a control set Plese refer to
annoate()
for the argument documentation.
- annotate(*args)[source]#
Record a round of annotation(review).
- Parameters:
Dictionary – If only one positional argument is provided, it should be a dictionary with keys being the document id and values being the corresponding labels.
Pair – If two positional arguments are provided, the first one is treated as the documents and the second one is the corresponding labels. The length of the two lists are required be the same.
- Returns:
Number of documents being annotated at this round.
- Return type:
int
- getReviewedIds(round: int) np.ndarray[int] [source]#
Get the list of document id that are reviewed at
round
.
- property control_mask: np.ndarray[bool | np.nan]#
The mask of the size of the collection for the control documents.
- property n_rounds: int#
Number of rounds have executed
- property n_docs: int#
Number of docuemnts in the collection.
- property n_annotated: int#
Total number of documents have been reviewed.
- property n_pos_annotated: int#
Total number of documents have been labeled as positive (relevant).
- property n_neg_annotated: int#
Total number of documents have been labeled as negative (non-relevant).
- property annotated: np.ndarray[bool]#
The mask of the size of the collection for the annotated (reviewed) documents.
- property annotation: np.ndarray[bool | np.nan]#
List of the annotations. Documents that have not been reviewed are recorded as
np.nan
. Control documents are considered not annotated.
- property isDone: bool#
Whether all documents have been reviewed (including control docuemnts.).
- getAnnotationCounts() List[Dict[bool, int]] [source]#
Get a list of dictionaries that records the number of positive and negative documents reviewed in each round.
- freeze() FrozenLedger [source]#
Get a frozen version of the current ledger.
- freeze_at(round: int) FrozenLedger [source]#
Get a frozen version of the ledger at
round
.
- class tarexp.ledger.FrozenLedger(org_ledger: Ledger)[source]#
Bases:
Ledger
A frozen ledger prohibits the record being modified. The underlying record (implemented as a numpy array) is locked as not writable.
All methods and properties of
Ledger
are supported excepttarexp.ledger.Ledger.annotate()
which will raise aFrozenInstanceError
when invoked.- property n_rounds#
Number of rounds this frozen ledger have recorded.