liwca.count#

liwca.count(texts: Iterable[str] | Series, dx: DataFrame, *, tokenizer: Callable[[str], list[str]] | None = None, precision: int | None = None, return_words: Literal[False] = False) → DataFrame[source]#

Count LIWC dictionary categories across documents (pure-Python).

Returns proportions: per-document, the sum of matched contributions divided by total word count. For binary dictionaries this is the fraction of doc tokens in each category (in [0, 1]). For weighted dictionaries it is the per-token mean weight (e.g. mean sentiment per word for VADER-style lexicons; arbitrary range, signed allowed).

Parameters:

texts (Iterable of str or Series) – The documents to analyse. Each element is a single document string.
dx (DataFrame) – A LIWC dictionary DataFrame as returned by liwca.read_dic, liwca.read_dicx, liwca.read_dicx_weighted, or liwca.create_dx. Index contains dictionary terms (may include * wildcards); columns are category names. Values are either binary (int8 0/1) or signed weights (float64).
tokenizer (Callable, optional) – A function str -> list[str] used to split each document into lowercase tokens. Defaults to a regex tokenizer that preserves contractions (don't → ["don't"]).
precision (int, optional) – If set, round proportion columns to this many decimal places. The "WC" column is never rounded.
return_words (bool, optional) – If True, return a tuple (categories, words) where words is a documents x tokens DataFrame holding per-word proportions for every dictionary token that appeared in the corpus. Wildcard entries are expanded to the actual corpus tokens that matched (e.g., recall* → recalled, recalling, …). Default False.

Returns:

When return_words=False (default): a documents x categories DataFrame. Index matches the input order (or the Series index if a Series was passed). Columns are the dictionary category names. An additional "WC" column contains the total word count for each document.

When return_words=True: a tuple (categories, words) where categories is the DataFrame described above and words is a documents x tokens DataFrame with one column per matched dictionary token plus a "WC" column.

Return type:

DataFrame or tuple of DataFrame

Examples

>>> import liwca
>>> dx = liwca.fetch_threat()
>>> texts = [
...     "This is a grave threat to our safety.",
...     "All is calm today.",
... ]
>>> liwca.count(texts, dx)
Category  WC  threat
0          8   0.125
1          4   0.000

Get per-word contributions:

>>> cats, words = liwca.count(texts, dx, return_words=True)
>>> words.columns.tolist()
['WC', 'grave', 'threat']

liwca.count

On this page

liwca.count#