liwca.ddr#

liwca.ddr(texts, dx, embeddings, *, tokenizer=None, precision=None)[source]#

Score documents against dictionary categories via DDR.

Parameters:
  • texts (Iterable of str or Series) – The documents to score. Each element is a single document string.

  • dx (DataFrame) – A LIWC dictionary DataFrame as returned by liwca.read_dx. Index contains dictionary terms (may include * wildcards), columns are category names, values are binary (0/1).

  • embeddings (str or Mapping) – Word embeddings to use. Pass a string to load a pre-trained model via gensim.downloader.load (requires pip install liwca[ddr]), e.g. "glove-wiki-gigaword-100". Or pass any mapping that supports embeddings[word] and word in embeddings — a plain dict, gensim KeyedVectors, etc.

  • tokenizer (Callable, optional) – A function str -> list[str] used to split each document into lowercase tokens. Defaults to a regex tokenizer that preserves contractions (identical to liwca.count’s default).

  • precision (int, optional) – If set, round cosine similarity values to this many decimal places.

Returns:

A documents x categories DataFrame. Index matches the input order (or the Series index if a Series was passed). Columns are the sorted dictionary category names. Values are cosine similarities in [-1, 1], or NaN when the document or category vector is undefined (see module docstring for OOV details).

Return type:

DataFrame

Examples

>>> import liwca
>>> dx = liwca.fetch_threat()
>>> results = liwca.ddr(
...     ["danger lurks ahead"],
...     dx,
...     "glove-wiki-gigaword-100",
... )