jkos Do you have any opinion on TenTen corpus in Sketch Engine..?
@BoraMac
Yes, actually! I created the TDC corpus after being a little frustrated working with the TenTen corpus.
So...their corpus is about 150 million words, vs. ours being 24 million words, so it's roughly 6x bigger, so that's a plus.
But they have a weird thing where a HUGE about of their corpus come from erotica. ; ) Other than noticing this from just reading samples from the corpus, if you look at their source URLs (which they do publish), you'll see a ton of it is from erotic short stories on the web. Another example as proof: Our corpus has 91 instances of "libog" = "horny", or 4 instances per million words. Their corpus has roughly 15,000 instances of "libog", or 10,000 instances per million words...wow! ; ) So....that creates a weird bias in their data.
For the record, any corpus is going to have bias...our TDC corpus has a strong bias toward news articles and internet comments (casual text).
Some of the tools are cool, but most are not so useful. A lot of their language analysis tools get mucked up because of Tagalog's ligature system...it doesn't understand that "bata" and "batang" are the same word, for example.
It's definitely a decent added resource...but I think it's something like $7/mo? So maybe not necessary if you just want to use the TDC corpus...unless you really have a need for a ton more data to work with, or are really going deep into some linguistic analysis.