The challenge of extracting meaningful information from an ever-growing Internet awash in languages, dialects, and knowledge domains is clearly, too much for our brains to handle.
And traditional approaches are simply not up to the task.
However, a combination of statistical methods, data mining and machine learning could help change all that.
Mari-Sanna Paukkeri, a doctoral candidate at the Aalto University Department of Information and Computer Science in Finland has developed computational methods of text processing that are independent of any language or knowledge domain.
Languages share certain building blocks: symbols form words and words aggregate into sentences. Algorithms developed by Paukkeri analyze massive bodies of text and discover patterns to the presence of words and the structure of sentences. As a result, the meaning of specific words and sentences can be inferred.
To date, computational approaches to natural language processing have typically relied on rules defined in advance. Instead, Paukkeri’s algorithms use unsupervised machine learning to uncover meaning from statistical dependencies and structures that exist in the dataset with no help from data pre-processing or human intervention of any kind.
A familiar use of unsupervised machine learning and natural language processing would be the ability of Google News to bundle related new stories on any topic the user requests.
As such, Paukkeri’s methods have the potential to serve global corporations particularly well because they can glean meaningful insights from vast storehouses of data across multiple languages and knowledge domains.
Paukkeri has even studied how a search engine could ascertain if the user is an expert or a layperson and return suitable results by automatically assessing the difficulty of comprehension in the text it finds.
For more, see the original article here.