Whats New in Version 2.5?¶
Version 2.5 of LingPy introduces a couple of new algorithms, but more importantly, much more consistency than earlier versions. As of now, LingPy runs in Python 2 and Python 3 across all major platforms, and we try hard to make sure it stays in this way.
Below, we list some interesting new features shipped along with LingPy 2.5 and which you might want to test.
Enhanced Documentation¶
You may not notice this directly, but we have enhanced the documentation. New algorithms which have
been added are all fully documented, but in addition, we have finally updated the documentation for
the deep alignment modules, such as calign and
misc. If you want to use these functions to speed up your
alignments, or to create your own algorithms that build on LingPy’s deep alignment functions, you
should definitely have a look at the additional documentation and some examples which we added.
The Command Line Interface¶
Version 2.5 introduces a command line interface which remains purely experimental for the moment, but will be further enhanced in the future in order to enable users to include LingPy in their pipelines with other programs.
As an example, after installing LingPy on your machine, just open a terminal, and try the following:
$ lingpy -h
This will show you all basic instructions regarding the usage of the command line interface. The basic idea here is that we split the commandline into different subtasks:
$ lingpy pairwise -h
usage: lingpy pairwise [-h] [--input-file INPUT_FILE]
[--output-file OUTPUT_FILE] [--factor FACTOR]
[--gop GOP] [--scale SCALE]
[--restricted-chars RESTRICTED_CHARS]
[--strings STRINGS STRINGS]
[--mode {global,local,overlap,dialign}] [--distance]
[--method {sca,basic}]
...
If you want to align a couple of strings, you can, for example, do the following:
$ lingpy multiple -s woldemort walter waldemar
w o l d e m o r t
w a l t e - - r -
w a l d e m a r -
There are many more possibilities, but since this features is still experimental, we did not work out a full documentation, so for the moment, we recommend to follow the instructions provided by adding the “-h” command to each of the subcommands currently available, when testing this feature.
Slowly Saying Goodbye to the Alias System¶
The alias-system in LingPy wordlists has lead to some confusion among users. Basically, the idea was to guarantee a maximum of flexibility. As a result, the flexibility lead to a high degree of confusion. We should note, however, that most of these aspects are documented, especially in our tutorial section, and it is recommended for all users and those interested in using LingPy to give these a proper read.
As of version 2.5, however, we re-arranged the handling to potentially overcome the problem of
namespaces by adding explicit arguments to the main classes for alignments and cognate sets, like
Alignments and LexStat. As of now, you can specify
integral parts of the data by passing how you define your namespace as an argument upon
initialization. For a LexStat analysis, for example, we need to know where we find the following
information and how it should be named:
- transcription (usually called “ipa”)
- segments (usually called “tokens”, if not given, it is created from “transcription”)
- classes (the sound classes, which are usually created from “segments”)
- langid (language identifiers, which are needed to create the individual segments for each language, usually created)
- prostrings (the prosodic strings, distinguishing context, usually created)
- numbers (the combination of langid, classes, and prostrings to form unique segment representations for each segment in each language, also the basic value passed to the scoring function)
- duplicates (the column in which duplicates are stored, and marked by a 1 for a duplicated entry, that is, an entry which is identical with another entry with a different meaning, usually created)
For an Alignments analysis, we need to know:
- transcription,
- segments, and
- alignment (for alignment analyses, the argument where alignments should be stored, created automatically or manually)
From now on, you can define your own namespaces by passing those as arguments when loading a wordlist for a LexStat or an alignment analysis:
>>> lex = LexStat('myfile.tsv', segments="segments", transcription="transcription")
>>> alm = Alignments('myalms.tsv', segments="segments", alignment="myperfectalignment")
This allows for a much greated flexibility and is a first step towards replacing the alias and the rc-configuration file system in which the namespace was handled previously.
Partial Cognate Detection¶
We present an initial attempt to carry out partial cognate detection and which is shipped along with
LingPy 2.5 and accessible via the Partial class which
itself expands the LexStat class for automatic cognate
judgments. The main requirement for partial cognate detection is that your data is morphologically
annotated. Morphological annotation will be assumed automatically, if your dataset contains tone
annotations in South-East-Asian style, with superscript or subscript numbers, since in most of these
languages each syllable corresponds to one morpheme. If this is not given, you need to provide
morphological annotations by adding one of the currently accepted symbols for morpheme segmentation.
If you want to know which symbols are currently accepted, you can check with the
rc function:
>>> from lingpy import rc
>>> rc('morpheme_separators')
'◦+'
>>> rc('word_separators')
'_#'
As you can see, we currently support two morpheme separators and two word separators. Note, however, that there is no difference in treatment when using the new method for partial cognate detection: whether your data uses the morpheme separators or the word separators should not make a difference, since LingPy will split all words in your data into their smallest units.
In order to test partial cognate detection, you can use a very small test set shipped with the LingPy test suite:
>>> from lingpy.compare.partial import Partial
>>> from lingpy.tests.util import test_data
>>> part = Partial(test_data('partial_cognates.tsv'), segments="segments")
>>> part.partial_cluster(method='sca', cluster_method='upgma', ref='partial_ids')
>>> for k in [1,2,3,4,5]:
print(k, ''.join(part[k, 'segments']), ' '.join([str(x) for x in part[k, 'partial_ids']]))
...
1 xu²⁴+ni⁵⁵ 3 4
2 xu²⁴ni⁴⁴ 3 4
3 xu³⁵+ni⁵⁵ 3 4
4 bu¹³li⁵³ 1 2
5 bu¹³ 1
Note that now, if you provide both partial cognate sets and segmented words, you can even align all partial cognate sets in separation. The only current restriction here is that you need to follow our namespace for fuzzy cognates as defined in the wordlist.rc (but you can modify it accordingly by just creating your own wordlist.rc and passing it as an extra argument):
>>> alm = Alignments(test_data('partial_cognates.tsv'),ref='partial_ids', segments='segments')
>>> alm.align()
>>> print(SCA(alm.msa['partial_ids'][1]))
x u ³⁵
x u ²⁴
x u ²⁴
x u ²⁴
The basic idea behind this change is a different handling of etymological dictionaries within the
Wordlist class: Earlier, an etymological dictionary was just a
transposed representation in which the cognate identifier was the key of a dictionary, and a list of values in order
of the languages linked to the words, with a zero indicating absence of values:
>>> part = Partial(test_data('partial_cognates.tsv'), segments="segments")
>>> etd1 = part.get_etymdict('partialids2')
>>> etd2 = part.get_etymdict('pcogsets')
>>> len(etd1), len(etd2)
13, 19
>>> etd1.keys()
dict_keys(['7 8 9', '18 16 17', '3 4', '18 16', '13', '12', '1 1', '5 6', '1 2', '14 15', '12 10 11', '12 10', '1'])
>>> etd2.keys()
dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
In the case of etd1, we reference the column “partialcogids2” for the creation of the
etymological dictionary, but for this column, there is no instruction in LingPy, no column type
defined in the wordlist-rc file. As a result, its values are interpreted as strings, and the
etymological dictionary that is created has strings as keys. In the case of the second column, the
column “pcogsets” is a reserved namespace for partial cognate sets, and all values are converted to
lists of integers. If the Wordlist class (or its descendants) creates an etymdict and detects a list
type instead of a string type or an integer type, it automatically assumes that the data is “fuzzy”,
and it creates a different etymdict in which each item of the list of integer ids is transformed
into the key of the dictionary.
Note that even if this sounds a bit confusing in the moment, we will try to further improve the handling of this feature. This is closely related to the general change of wordlist handling in LingPy using the wordlist-rc files, which we will try to replace with a more consistent system.
Evaluation methods¶
If you want to test how well a partial cognate detection analysis works, there is a special function
for evaluation, and its usage is idential with the usage of the classical
bcubes function:
>>> from lingpy.evaluate.acd import partial_bcubes
>>> partial_bcubes(lex, "partialid", "pcogsets")
Furthermore, you can test n-point average precision (npoint_ap) on string similarity measures, which we
implemented following the suggestion by Kondrak2002. The n-point average precision is a
measure that can be used to test how well string similarities or distances distinguish cognates from
non-cognates, and it is therefore useful for the evaluation or the training of different string
similarity measures:
>>> from lingpy.evaluate.acd import npoint_ap
>>> from lingpy import *
>>> word_pairs = [("harry", "gari"), ("walter", "woldemort"), ("harry", "walter"), ("gari", "woldemort")]
>>> dists = [edit_dist(a, b, normalized=True) for a, b in word_pairs]
>>> cogs = [1, 1, 0, 0] # these are our cognates for all word pairs
>>> npoint_ap(dists, cogs)
1.0




