tarexp.dataset module#

A dataset contains the essential information of the collection for retrieval. It is designed to be a static variable throughout the TAR run.

Encoded documents in vectors are required but not limited any form. Depending on the intended experiments, the vectors can be generated by scikit-learn TFIDF vectorizer or even Huggingface Transformers tokenizers. We leave this flexibility to the users for further extensions.

Groud truth labels are essential when running experiments without human intervention. It is also used in most evaluation that requires ground truth labels. If the workflow is designed to run with actual human reviewing, the labels is no longer required.

class tarexp.dataset.Dataset(name=None)[source]#

Bases: object

Meta class of TARexp dataset.

The class defines the basic features for a dataset. All downstream datasets that inherits this class should implement the following properties or method:

  • Essentials

    • identifier

      The unique identifier of the dataset. It is used in verifying the dataset provided is identical when resuming a workflow. The identifier should summarize both vectors and the labels with a hash that does not depends on the memory location of the variable (e.g. the built-in hash() function) but the actual content. Utility function tarexp.util.stable_hash() provides such capability.

    • ingest()

      It ingests a list of raw text into vectors and stored in the attribute _vectors.

    • getAllData()

      (Optional) It returns all vectors of the documents in the dataset. Ideally, it should returns a copy of the vector but could also be a reference if the colletion is too large to copy in memory. This meta class implemented a simple version but user implementing new dataset class should consider re-implement it to support on-demand processing of the vectors (such as collators in pyTorch).

    • getTrainingData()

      (Optional) It takes a tarexp.ledger.Ledger as an argument and returns the vectors of reviewed documents and labels from the ledger. This meta class also already implemented a simple version but should consider re-implementing for the same reason as getAllData().

    • duplicate()

      Returns a copy of the dataset along with any information that should be copied. This method should perform deep copy on all containing objects to prevent memory referencing the prevent fast multi-processing.

  • Labels (Optional)

    • labels

      The labels of the dataset. We recommand implementing this information as a property instead of an attribute of the class to prevent modifying the labels by accident during the workflow. If the labels are intended to be unavailable, please consider raise an NotImplemented exception instead of NotImplementedError to reflect the intention.

    • pos_doc_ids and neg_doc_ids

      The set of positive and negative docuemnt ids.

    • setLabels()

      The method that returns a new dataset that contains the label of all documents in the dataset. It should also check all documents are set with a label. Spawning a new instance makes sure that the original dataset instance is not polluted.

property identifier#
property name#
property n_docs#

Number of documents in the dataset.

property labels#
property hasLabels#
property pos_doc_ids: set#
property neg_doc_ids: set#
ingest(text, force=False)[source]#
setLabels(labels, inplace=False)[source]#
getAllData(copy=False)[source]#
getTrainingData(ledger: Ledger)[source]#
duplicate()[source]#
classmethod from_text(text, **kwargs)[source]#

Class factory method that returns an instance of the class with ingested text.

class tarexp.dataset.SparseVectorDataset(vectorizer=None)[source]#

Bases: Dataset

Dataset with Scipy Sparse Matrix.

Parameters:

vectorizer – A function or a class instance that has a fit_transform method (such as the vectorizers from scikit-learn).

property n_docs#

Number of documents in the dataset.

property labels#

Returns a copy of the labels of all docuemnts.

property identifier#
property pos_doc_ids: set#

Returns the ids of the positive documents.

property neg_doc_ids: set#

Returns the ids of the negative documents.

ingest(text, force=False)[source]#

Ingest the text using the vectorizer and store the vectors in this instance.

Parameters:
  • text – A list of text that will be ingested and stored. If the labels are set, the length of the list should be identical to the length of labels.

  • force – Whether skipping the test on the length of the text and the labels.

setLabels(labels, inplace=False)[source]#

Returns a new datset with new labels.

Parameters:
  • labels – A list or Numpy array of binary labels. The length should match the number of documents in the dataset.

  • inplace – Whether applying this set of labels to the current dataset. If True, the method will replace the labels and returns None. Default False.

duplicate(deep=False)[source]#

Duplicate the dataset.

Parameters:

deep – Whether to perform deep copy on the vectors. Default False.

classmethod from_sparse(matrix)[source]#

Create a SparseVectorDataset instance from a sparse matrix.

class tarexp.dataset.TaskFeeder(dataset: Dataset, labels: Any)[source]#

Bases: object

Python Iterator that yields review tasks with different set of labels given the same base dataset (a dataset without labels.)

This class support both iterator in for loop or next() function and index look up [] if the list of labels provided has already been materialized (not an iterator).

Parameters:
  • dataset – A Dataset instance that does not contain any label. This instance will spawn downstream tasks with different labels

  • labels

    If a Python dictionary is provided, the key is considered as the names of the tasks and the values are the corresponding labels. The length of all set of labels should be the same as the number of documents provided inbase dataset.

    If a Pandas DataFrame is provided, the columns are considered to be the tasks where the column name are fed as the task name. The number of rows in the DataFrame should be the same as the number of documents in the base dataset.

    If an iterator is provided, the length of the labels is not checked against the number of documents in the base dataset and a warning will be raised. The iterator should yield a tuple of the name of the task and the corresponding labels. The order of should also be stable especially when running experiments across multiple machines. If the iterator support length via __len__, the class will respect it.

classmethod from_irds(**kwargs)[source]#

Danger

Has not implemented yet. Coming soon.