liwca.ddr#

liwca.ddr(texts, dx, embeddings, *, tokenizer=None, precision=None)[source]#

Score documents against dictionary categories via DDR.

Parameters:

texts (Iterable of str or Series) – The documents to score. Each element is a single document string.
dx (DataFrame) – A LIWC dictionary DataFrame as returned by liwca.read_dx. Index contains dictionary terms (may include * wildcards), columns are category names, values are binary (0/1).
embeddings (str or Mapping) – Word embeddings to use. Pass a string to load a pre-trained model via gensim.downloader.load (requires pip install liwca[ddr]), e.g. "glove-wiki-gigaword-100". Or pass any mapping that supports embeddings[word] and word in embeddings - a plain dict, gensim KeyedVectors, etc.
tokenizer (Callable, optional) – A function str -> list[str] used to split each document into lowercase tokens. Defaults to a regex tokenizer that preserves contractions (identical to liwca.count’s default).
precision (int, optional) – If set, round cosine similarity values to this many decimal places.

Returns:

A documents x categories DataFrame. Index matches the input order (or the Series index if a Series was passed). Columns are the sorted dictionary category names. Values are cosine similarities in [-1, 1], or NaN when the document or category vector is undefined (see module docstring for OOV details).

Return type:

DataFrame

Examples

>>> import liwca
>>> dx = liwca.fetch_threat()
>>> results = liwca.ddr(
...     ["danger lurks ahead"],
...     dx,
...     "glove-wiki-gigaword-100",
... )

liwca.ddr

On this page

liwca.ddr#