Welcome to Lingualytics’s documentation!¶

Indices and tables¶

Stopwords¶

Preprocessing¶

lingualytics.preprocessing.remove_lessthan(s: pandas.core.series.Series, length: int) → pandas.core.series.Series¶

Removes words less than a specific length.

Parameters

s (pd.Series) – A pandas series.
length (int) – The minimum length a word should have.

lingualytics.preprocessing.remove_links(s: pandas.core.series.Series) → pandas.core.series.Series¶

Removes links from the text.

Parameters: s (pd.Series) – A pandas series.

lingualytics.preprocessing.remove_punctuation(s: pandas.core.series.Series, punctuation: str = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~') → pandas.core.series.Series¶

Removes punctuation from the text.

Parameters

s (pd.Series) – A pandas series.
punctuation (str) – All the punctuation characters you want to remove.

lingualytics.preprocessing.remove_stopwords(s: pandas.core.series.Series, stopwords: list)¶

Removes stopwords from the text.

Parameters

s (pd.Series) – A pandas series.
stopwords (list of str) – A list of stopwords you want to remove.

Representation¶

lingualytics.representation.get_ngrams(s: pandas.core.series.Series, n: int, delimiter: str = ' ', merge: bool = False)¶

Return a list of n-grams in descending order of their occurences.

Parameters

n (int) – Length of n in n-grams.
delimiter (str) – The delimiter which separates any two words.

Classification¶

class lingualytics.learner.CustomDataset(*args, **kwds)¶

class lingualytics.learner.Learner(data_dir='./dataset', output_dir='./output', dataset=None, lr=5e-05, num_train_epochs=5, train_bs=64, eval_bs=64, model_type='bert', model_name='bert-base-multilingual-cased', save_steps=1, seed=42, max_seq_length=256, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, device=None)¶

Return a list of n-grams in descending order of their occurences.

Parameters

data_dir (str) – Path of the dataset.
output_dir (str) – Path where the trained model and predictions will be saved.
dataset (str) – The dataset to use from list of our available datasets. Set to None to use your own dataset.
lr (float) – The learning rate for training.
num_train_epochs (int) – Number of epochs to train.
train_bs (int) – Batch size for training.
eval_bs (int) – Batch size while evaluating.
model_type (str) – The type of model to use from Huggingface.
model_name (str) – The name of the model to use from Huggingface.
save_steps (int) – Number of epochs to wait before saving the model again.
seed (int) – The seed to set at all places.
max_seq_length (int) – The maximum sequence length.
weight_decay (float) – Weight decay for training.
adam_epsilon (float) – Adam epsilon for training.
max_grad_norm (float) – Maximum gradient norm.
device (str) – Force the device ‘cpu’ or ‘gpu’ for Tensors

acc_and_f1(preds, labels)¶

collate(examples)¶

convert_examples_to_features(examples, label_list, tokenizer)¶

download_dataset()¶

evaluate(mode, prefix='')¶

fit()¶: Download and finetune the model on the dataset.

get_labels()¶

load_and_cache_examples(tokenizer, labels, mode)¶

read_examples_from_file(mode='train')¶

set_seed()¶

setup_model()¶

simple_accuracy(preds, labels)¶

train(train_dataset, valid_dataset)¶