Welcome to Lingualytics’s documentation!¶
Indices and tables¶
Stopwords¶
Preprocessing¶
-
lingualytics.preprocessing.
remove_lessthan
(s: pandas.core.series.Series, length: int) → pandas.core.series.Series¶ Removes words less than a specific length.
- Parameters
s (pd.Series) – A pandas series.
length (int) – The minimum length a word should have.
-
lingualytics.preprocessing.
remove_links
(s: pandas.core.series.Series) → pandas.core.series.Series¶ Removes links from the text.
- Parameters
s (pd.Series) – A pandas series.
-
lingualytics.preprocessing.
remove_punctuation
(s: pandas.core.series.Series, punctuation: str = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~') → pandas.core.series.Series¶ Removes punctuation from the text.
- Parameters
s (pd.Series) – A pandas series.
punctuation (str) – All the punctuation characters you want to remove.
-
lingualytics.preprocessing.
remove_stopwords
(s: pandas.core.series.Series, stopwords: list)¶ Removes stopwords from the text.
- Parameters
s (pd.Series) – A pandas series.
stopwords (list of str) – A list of stopwords you want to remove.
Representation¶
-
lingualytics.representation.
get_ngrams
(s: pandas.core.series.Series, n: int, delimiter: str = ' ', merge: bool = False)¶ Return a list of n-grams in descending order of their occurences.
- Parameters
n (int) – Length of n in n-grams.
delimiter (str) – The delimiter which separates any two words.
Classification¶
-
class
lingualytics.learner.
CustomDataset
(*args, **kwds)¶
-
class
lingualytics.learner.
Learner
(data_dir='./dataset', output_dir='./output', dataset=None, lr=5e-05, num_train_epochs=5, train_bs=64, eval_bs=64, model_type='bert', model_name='bert-base-multilingual-cased', save_steps=1, seed=42, max_seq_length=256, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, device=None)¶ Return a list of n-grams in descending order of their occurences.
- Parameters
data_dir (str) – Path of the dataset.
output_dir (str) – Path where the trained model and predictions will be saved.
dataset (str) – The dataset to use from list of our available datasets. Set to None to use your own dataset.
lr (float) – The learning rate for training.
num_train_epochs (int) – Number of epochs to train.
train_bs (int) – Batch size for training.
eval_bs (int) – Batch size while evaluating.
model_type (str) – The type of model to use from Huggingface.
model_name (str) – The name of the model to use from Huggingface.
save_steps (int) – Number of epochs to wait before saving the model again.
seed (int) – The seed to set at all places.
max_seq_length (int) – The maximum sequence length.
weight_decay (float) – Weight decay for training.
adam_epsilon (float) – Adam epsilon for training.
max_grad_norm (float) – Maximum gradient norm.
device (str) – Force the device ‘cpu’ or ‘gpu’ for Tensors
-
acc_and_f1
(preds, labels)¶
-
collate
(examples)¶
-
convert_examples_to_features
(examples, label_list, tokenizer)¶
-
download_dataset
()¶
-
evaluate
(mode, prefix='')¶
-
fit
()¶ Download and finetune the model on the dataset.
-
get_labels
()¶
-
load_and_cache_examples
(tokenizer, labels, mode)¶
-
read_examples_from_file
(mode='train')¶
-
set_seed
()¶
-
setup_model
()¶
-
simple_accuracy
(preds, labels)¶
-
train
(train_dataset, valid_dataset)¶