If you REALLY want to go down the rabbit hole... ; )
.
.
.
I wrote a program where I can paste in Tagalog text and it will analyze the text.
It detects English vs. Filipino words based on a 24,000 word English database to get better info. Then after it spits out words it doesn't find in our database, I'll manually take out any English words the program missed to make the data exact.
This is an example of the output using 5 news articles from Balita.net:
Overview:
Total # Word Instances: 2841 ~ 9.47 written pages
Filipino # Instances: 1843
English # Instances: 998
# Unique Filipino Words: 666
# Unique English Words: 406
# Unique Words Including English: 1072
Looking only at Filipino words, our dictionary did not find exact matches for:
2.5% ( 46 / 1843 )
...of all instances of all Filipino words in the sample text ~ 4.86 words per page.
Looking at Filipino words, our competitors' dictionaries did not find exact matches for:
7.87% ( 145 / 1843 )
...of all instances of all Filipino words in the sample text ~ 15.3 words per page.
List of words where there is not an exact match in our dictionary:
buhaybilanggo: 4 Corpus Instance Count: 0.
pinaligtas: 2 Corpus Instance Count: 7.
nagpabagsak: 1 Corpus Instance Count: 32.
pagsasakripisyo: 1 Corpus Instance Count: 40.
pagsasabatas: 1 Corpus Instance Count: 79.
pinagbotohan: 1 Corpus Instance Count: 0.
maisumite: 1 Corpus Instance Count: 26.
pagbotohan: 1 Corpus Instance Count: 32.
nairaraos: 1 Corpus Instance Count: 0.
manggawa: 1 Corpus Instance Count: 0.
nagpaluhod: 1 Corpus Instance Count: 0.
pangangalanan: 1 Corpus Instance Count: 30.
makapaglakbay: 1 Corpus Instance Count: 3.
makaadapt: 1 Corpus Instance Count: 4.
maitlis: 1 Corpus Instance Count: 0.
pagkukumpara: 1 Corpus Instance Count: 48.
pagdepende: 1 Corpus Instance Count: 23.
nadaragdagan: 1 Corpus Instance Count: 17.
kabalitaran: 1 Corpus Instance Count: 0.
nakawork: 1 Corpus Instance Count: 15.
kalalawigan: 1 Corpus Instance Count: 4.
magpasamantalang: 1 Corpus Instance Count: 0.
nagoverprice: 1 Corpus Instance Count: 0.
paglusog: 1 Corpus Instance Count: 5.
pagpapahiram: 1 Corpus Instance Count: 3.
paguwiuwi: 1 Corpus Instance Count: 0.
sumakabilang: 1 Corpus Instance Count: 38.
pilantropo: 1 Corpus Instance Count: 10.
nakakuwentuhan: 1 Corpus Instance Count: 37.
kanangkamay: 1 Corpus Instance Count: 6.
pananagasa: 1 Corpus Instance Count: 7.
iquarantine: 1 Corpus Instance Count: 0.
nakitil: 1 Corpus Instance Count: 7.
makapagrecord: 1 Corpus Instance Count: 0.
lukaok: 1 Corpus Instance Count: 0.
gurami: 1 Corpus Instance Count: 4.
tagpasin: 1 Corpus Instance Count: 4.
pagtataboy: 1 Corpus Instance Count: 16.
namiminsala: 1 Corpus Instance Count: 3.
magbuhaybilanggo: 1 Corpus Instance Count: 0.
nakapangingilabot: 1 Corpus Instance Count: 7.
lumubo: 1 Corpus Instance Count: 0.
You'll notice...the missing words list also shows the frequency count from our 24 million word corpus...to try to indicate which words I'm missing that are important and should be prioritized for data entry.
Also as a side note: This is only counting EXACT MATCHES or EXACT MATCH + NG. A lot of times the missing words will have a very close word or alternate spelling that will show in our search results...so that 2.5% missing is actually better than that rate. I'm also giving the competitor dictionaries credit for exact matches on word+ng ligature...even though often times their algorithm won't show that search result unless you search for both manually...
ANYWAY...that all is to say...there's a lot more than might be immediately obvious that is going on in the background to grow this dictionary efficiently.