Streamlining web mining
Last Sunday I submitted my comment to the people vs machine debate in Research Magazine. Some readers of this comment asked me how I get 97% accuracy of sentiment changes’ measurement in the Web Mining.
Web text analytics is rather new field of research and everybody is using its own approach. So, I would only advice – don’t want to be too quick. If you collect millions of records and focus on thousands of specific sentiment-rich expressions, first look at this data. Make some basic descriptive statistics (Yes!), make some charts of the frequency distributions etc. Try to find proper way of stratification, using your best proven approaches and tools. Don’t avoid this basic examination – I write this because I see many freshmen in analytic business who want to cut corners.
If you find good way of data stratification you will undoubtedly notice, that some expressions occur most frequently in one or two or three specific contexts or specific subject domains. Follow this clue, and limit further research to these expressions. This is the first step to the discourse mining (not simply text mining).
Next steps are obvious. Look for relations between various characteristics of the contexts, subject domains, and these “good” expressions. Make clustering in order to select subjects domains and texts you need. Make the selection from your corpus of texts.
There are a lot of tools to extrude rich and accurate information from data selected in this way.
Limiting the scope of study is the first and very basic way to streamline any research process. It is also a basic step used in Industrial Engineering in streamlining any manufacturing or business process.