Web Intelligence 2010
August 31 - September 3, 2010, Toronto, Canada
Background and Motivation
Users: Need to Find Information
- One information-finding need is locating: we know it exists, need to find where it is.
- Another information-finding need is discovery: we don't know it exists, we need to find out that it does.
Solutions: Search Engines and Web Directories
- A Search Engine is a machine for locating information. It is automated, uses machine learning techniques.
- A Web Directory allows browsing (information discovery). Even state-of-the art directories are maintained manually.
Economic Reality: Web Directories
- A Web Directory is expensive to maintain, so they are all in decline.
Result: Users Have a Problem...
- The information locating need is met by search engines.
- The information discovery part is lacking.
Problems of the Current Solutions
Each solution solves part of the problem
It work towards some information-finding goal and solves part of the problem, but has limitations.
- Web directories
- Provide exploration of information
- Limitation: confusing and cumbersome hierarchy (DMOZ has >700k categories)
- Limitation: expensive to build and maintain due to business model
- Limitation: the current model excludes some types of resources (only homepages are indexed)
- Search engines
- Assist locating of information
- Plus: relatively cheap to build (compared to models relying on manual indexing)
- Problem: huge webmaster economy abuses the model
- Limitation: search is stateless (no continuity, no user research developed over a period)
- Limitation: no personalization (in term of user preferences or user background)
- Limitation: user has no control over ranking settings
- Limitation: search query is not too expressive
- Limitation: user search context and background are unknown
- A gap exists... A solution is needed with:
- Global scope
- Directory structure
- Easy navigation
- Better user control
Proposed solution
We propose: Automated Web Directory for Intelligent Web Exploration
- Uses a spider such as the search engines use, to download billions of documents.
- Uses machine learning to classify them into a hierarchical structure of categories.
- Provides search within a category, with local keyword relevance (user has instant choice between millions of contexts).
- Allows users to provide relevance feedback on results and develop one of them into a complex personal context.
- System allows users to save their research, and to share it with others.
Issues we address
- Information discovery assisted by machine learning.
- Ability of users to select exploration context.
- Ability of users to provide their own exploration context.
- Ability of users to influence search results.
- Ability of users to reuse (re)search sessions.
- Cooperation between users.
- In one word: we give full control to the users. It's their search, after all.
Challenges
What are the main problems we have to face?
- Large volume of data (issues with scalability)
- Web data is very noisy (sources consist predominantly of formatting and cross-links, much spam)
- Constantly changing document collection (the web)
- Constantly changing user perceptions and opinions (gradually shifting classification of resources)
- Human classification is expensive and results are noisy (DMOZ and Yahoo! Directory are full of errors)
- Evaluation of results (classification) is subjective
Implementation
Web spider
- We implemented a guided web spider (URLs approved by a human editor)
- Texts are filtered based on limiting lengths of words and phrases
- For experiments, downloaded documents listed in the Open Directory Project (DMOZ) - editor-approved listings
- In real life, the spider will work and change the collection continuously
Classifier
- We tested Self-Organizing Maps, neural net and Bayesian classifier
- Selected Multinomial Naive Bayes (MNB) as tool of choice: simple, fast, generalizes well
- For every node in the category tree, we have a separate classifier: a hierarchical structure of MNBs
- Every classifier works with part of the document collection and calculates document statistics locally
- As a result, keyword relevance is different in every category
- Added bonus of using MNB: it calculates local TF-IDF weights for words in the course of its work
- Since the spider changes the collection continuously, the classifiers have to work continuously too
- In practical terms though, we do staggered iterations over snapshots of the data collections
Adaptations to algorithm
Problems of Multinomial Naive Bayes
- Standard MNB is static: not suited to continuous classification
- Iterations mean rising counts in word occurence statistics
- Changing classification of an instance leaves learned noise in data
- Algorithm has a huge problem with unbalanced data; our data is very unbalanced in a number of ways
Adaptations we did to Multinomial Naive Bayes
- Time decay of learned word occurence counts - solves the issue with changed classifcation (unlearning)
- Applied l2 normalization and class-specific word count normalization to document statistics
- Applied "train-on-error" policy: classifier trains on borderline cases only (policy used by industrial spam filters)
- Instead of using full class prior distributions over the whole collection, we use stochastic partial distributions over the last 10 000 errors
- The above has the effect of introducing a negative feedback loop which increase accuracy and (more imprtantly) smooths error variation between classes
- As a result, we have a classifier which works acceptably on noisy web data
System use cases
Search mode: similar to search engines, uses keyword-search paradigm
- Uses inverted document index (keyword-to-document) weighted by relative TF-IDF for the term in the document
- When searching within a category, search query terms are weighted based on their local TF-IDF in the category
Exploration mode: our innovation, uses the hierarhical classifier
- Users can browse the category as a static structure (replicates DMOZ, Yahoo! Directory...)
- Users can mark documents as either relevant or not relevant to what they want (system is usable by clicks only)
- Users can (optionally) submit a query which can be a list of words, or a whole document
- The query is weighted using Rocchio relevance feedback weighting: Q = αQu + βQp - γQn
- The relative weights: α (user query), β (positive feedback) and γ (negative feedback) are user-adjustable
- Search results come in two forms: relevant documents and suggested categories
- To suggest categories, the query is treated as a document and is passed through the backend classifier hierarchy
- Suggested categories are both "best fit" to the query and "next best": users can now browse related categories
- Clicking a category link starts the same process again, but within its context only
- Word weights are re-calculated for that category: we get a "floating query"
Usability additions
Saved (re)searches
- Users can spend signifcant efforts refining a query - it should not be lost after the end of the session: they can save it.
- Search sessions can be re-opened later, developed further and saved or forked as new.
- Users can add notes to a search session.
- Users can add their own bookmarked URL to search results.
- Saved search sessions can be made public to encourage collaboration between users.
Future work
Plans for the future
- Resolve scalability issues.
- Introduce collaborative filter to allow users to train the classifiers.
- Make classifiers fuzzy, since many documents belong to more than one category.
- Test user acceptance with the system installed on top of a commercial search engine.
Thank you
Thank you for your time.
Questions?
Contacts
For further questions or comments, please email
.
Web: www.pavka.com.au