germanetpy.icbased_similarity

Classes

Filterconfig(search_string[, ignore_case, ...])

This class is a configuration object, that helps to filter GermaNets lexical units and Synsets to extract the ones with certain interesting properties.

ICBasedSimilarity(germanet, wordcategory, path)

The IC-based measures are computed based on relative frequencies of words in a large corpus.

SemRelMeasure(*args, **kwargs)

This Enum represents the semantic relatedness measures

class germanetpy.icbased_similarity.ICBasedSimilarity(germanet, wordcategory, path: str, separator: str = '\t')[source]

Bases: object

The IC-based measures are computed based on relative frequencies of words in a large corpus. Synset frequencies are computed by adding up the frequencies of all words that belong to a Synset. These measures can not be computed between synsets with different word categories

create_simple_freq_dic(word_category, path: str, separator: str)[source]

Reads in the frequency list files and stores the frequency information for each Synset in a dictionary. The keys are the Synset IDs. This method also adds all available Synset frequencies for the given category.

Parameters:
  • word_category (WordCategory) – The word category

  • path – The path to a frequency list containing words and their frequencies in a corpus

  • separator – The char that separates a word and its frequency in the given frequency list

init_min_max_normalization_values(synset_pair) dict[source]

This methods computes the minimal values (two Synsets are equal) and the maximum values (two Synsets are maximally apart in the graph) for normalization

Parameters:

synset_pair (tuple(Synset, Synset)) – The Tuple of synsets that have the maximum distance in the graph

Returns:

a dictionary containing the (minimum value, maximum value) for each semantic similarity measure.

init_ic_map()[source]

Computes the information content for each synset in GermaNet (of a given word category).

Return type:

dict, Synset

Returns:

A dictionary with a Synset and the corresponding IC, a Synset with the highest IC

get_information_content(synset) float[source]

The information content graduates semantic concepts from general to specific. The more specific a concept, the smaller the probability and thus the higher its informativeness. The information content of a semantic con- cept is estimated by the relative frequency of the concept in a large corpus (cumulated synset frequency)

Parameters:

synset (Synset) – the information content should be computed for

Returns:

the information content for the given synset

resnik(synset1, synset2, normalize: bool = False, normalized_max: float = 1.0) float[source]

Two concepts are more related the more information they share. The shared information of two concepts can be quantified by the information content of two concepts’ lowest common subsumer. When several LCS are available the highest IC is returned.

Parameters:
  • synset1 (Synset) – The source synset

  • synset2 (Synset) – The target synset

  • normalize – The relatedness value can be normalized to a number between the possible minimum of that measure and a given upper bound.

  • normalized_max – The upper bound of the range the measure is normalized to.

Returns:

The information content of the LCS of the two given synsets.

jiang_and_conrath(synset1, synset2, normalize: float = False, normalized_max: float = 1.0) float[source]

The Jiang and Conraths measure includes knowledge about the individual information contents of each synset. The smaller the difference of the information content of the two synsets, the more related they are.

Parameters:
  • synset1 (Synset) – The source synset

  • synset2 (Synset) – The target synset

  • normalize – The relatedness value can be normalized to a number between the possible minimum of that measure and a given upper bound.

  • normalized_max – The upper bound of the range the measure is normalized to.

Returns:

The jiang and conrath relatedness measure

lin(synset1, synset2, normalize: bool = False, normalized_max: float = 1.0) float[source]

The lin measure takes the individual information contents of each synset and the information content of the LCS into account. The LCS with the highest information content is used for the computation.

Parameters:
  • synset1 (Synset) – The source synset

  • synset2 (Synset) – The target synset

  • normalize – The relatedness value can be normalized to a number between the possible minimum of that measure and a given upper bound.

  • normalized_max – The upper bound of the range the measure is normalized to.

Returns:

The Lin relatedness measure

normalize(raw_value: float, normalized_max: float, semrel_measure: SemRelMeasure) float[source]

Normalizes a raw value of semantic relatedness to a value between a lower bound and the given upper bound.

Parameters:
  • raw_value – The raw value

  • normalized_max – The upper bound

  • semrel_measure – The semantic relatedness measure, the value corresponds to.

Returns:

The normalized semantic relatedness value

property germanet
property root_freq
property synset2cumfreq
property jcnmaxdist
property normalization_dic
property synset2ic
property most_informative_synset
property synset2simple_freq