Welcome to Lingualytics’s documentation!

Indices and tables

Stopwords

Preprocessing

lingualytics.preprocessing.remove_lessthan(s: pandas.core.series.Series, length: int) → pandas.core.series.Series

Removes words less than a specific length.

Parameters
  • s (pd.Series) – A pandas series.

  • length (int) – The minimum length a word should have.

Removes links from the text.

Parameters

s (pd.Series) – A pandas series.

lingualytics.preprocessing.remove_punctuation(s: pandas.core.series.Series, punctuation: str = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~') → pandas.core.series.Series

Removes punctuation from the text.

Parameters
  • s (pd.Series) – A pandas series.

  • punctuation (str) – All the punctuation characters you want to remove.

lingualytics.preprocessing.remove_stopwords(s: pandas.core.series.Series, stopwords: list)

Removes stopwords from the text.

Parameters
  • s (pd.Series) – A pandas series.

  • stopwords (list of str) – A list of stopwords you want to remove.

Representation

lingualytics.representation.get_ngrams(s: pandas.core.series.Series, n: int, delimiter: str = ' ', merge: bool = False)

Return a list of n-grams in descending order of their occurences.

Parameters
  • n (int) – Length of n in n-grams.

  • delimiter (str) – The delimiter which separates any two words.

Classification

class lingualytics.learner.CustomDataset(*args, **kwds)
class lingualytics.learner.Learner(data_dir='./dataset', output_dir='./output', dataset=None, lr=5e-05, num_train_epochs=5, train_bs=64, eval_bs=64, model_type='bert', model_name='bert-base-multilingual-cased', save_steps=1, seed=42, max_seq_length=256, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, device=None)

Return a list of n-grams in descending order of their occurences.

Parameters
  • data_dir (str) – Path of the dataset.

  • output_dir (str) – Path where the trained model and predictions will be saved.

  • dataset (str) – The dataset to use from list of our available datasets. Set to None to use your own dataset.

  • lr (float) – The learning rate for training.

  • num_train_epochs (int) – Number of epochs to train.

  • train_bs (int) – Batch size for training.

  • eval_bs (int) – Batch size while evaluating.

  • model_type (str) – The type of model to use from Huggingface.

  • model_name (str) – The name of the model to use from Huggingface.

  • save_steps (int) – Number of epochs to wait before saving the model again.

  • seed (int) – The seed to set at all places.

  • max_seq_length (int) – The maximum sequence length.

  • weight_decay (float) – Weight decay for training.

  • adam_epsilon (float) – Adam epsilon for training.

  • max_grad_norm (float) – Maximum gradient norm.

  • device (str) – Force the device ‘cpu’ or ‘gpu’ for Tensors

acc_and_f1(preds, labels)
collate(examples)
convert_examples_to_features(examples, label_list, tokenizer)
download_dataset()
evaluate(mode, prefix='')
fit()

Download and finetune the model on the dataset.

get_labels()
load_and_cache_examples(tokenizer, labels, mode)
read_examples_from_file(mode='train')
set_seed()
setup_model()
simple_accuracy(preds, labels)
train(train_dataset, valid_dataset)