From Corpora To Matching
|
Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:
01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;
Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.
Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.
Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.
Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.
Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.
Compress the matrix:
There are two basic techniques/methods, Compress Row Storage(Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.
Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length
Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.
Singular Value Decomposition:
This simplifies a symmetric matrix into three matricesTwo are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.
A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.
The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.
Queries:
Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.
© I am the website administrator of the Wandle industrial museum (http://www.wandle.org). Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.
|
|
|
Search Engine Ranking - What Works Now
The methods employed to increase your search enginerankings may seem like rocket science to you, so you haveprobably avoided dealing with this issue. I am here to tellyou - the time has come to face your website! A high searchengine ranking for your website is so essential that if youhave the slightest desire to actually succeed in yourbusiness, there is no way you can continue to avoid thisissue.At least 85% of people looking for goods and services onthe Internet find websites through search engines such asGoogle, Yahoo, and MSN. The idea of optimizing your pagesfor high search engine rankings is to attract targetedcustomers to your site who...(related: Seo)
How To Avoid The 3 Biggest Title Tag Mistakes With Search Engines
Your title tag is the most important 3 to 12 words on your Web page. It accounts for up to 80% of your rankings on search engines.Here's why:Search Engines look for "searched for" words first in title tags.The title tag ...(related: Seo)
The Search Engine Secret That Is No Secret At All
It's common knowledge - we all know that it is important to rank well in search engines. Doing so can bring qualified, interested visitors to the web site you create. And, those visitors come to you believing that you know what you are talking about, since you showed up in the search engine results.All the time, people ask "How do I get my site to show up first when I type 'whatever' into the search engines?".Here's the secret:Create a great web site with useful information, and have more relevant links pointing back to your web site than your competitors.Surprised?You shouldn't be. The secret is not really a secret...(related: Seo)
The Mystery Of The Magical Keyword Density Formula
Keyword density. When it comes to SEO copywriting, this has to be one of the most talked about subjects. Why? Because keywords are the very foundation of search engine copywriting. Without keywords we wouldn't even have SEO copywriting. Because keywords (or more accurately, keyphrases) play such an important role in search engine copywriting, it might make sense that there are certain rules and regulations - certain formulas - that should be followed. It might make sense, but, I'm sorry to say, the mystery? the magic? is more like a myth.I have a guess as to where these magic formulas come from. Someone brags to their friend that they got #1 ranking for a particular keyphrase. The friend studiously looks over the site and starts taking notes. "He used this phrase eight times in a 500-word piece of copy. He put the keywords ...(related: Seo)
The Seo Gurus Poem
Am I alone as I survey that vast wilderness outside,Sat at home every day fettered by my own fool...(related: Seo)
Pay-per-click ? The Ultimate Tool To Boost Affiliate Sales
The old ways are not always the best ways.The traditional model of making money from affiliate products was simple. Create an interesting websi...(related: Seo)
10 Things To Expect From Your Seo Copywriter
From the perspective of a business owner, webmaster, or marketing manager, the change exhibited by the Internet is profoundly exciting, yet profoundly disturbing. The information (and misinformation and disinformation) it offers, the business benefits it promises, and the rules it is governed by change at such a rapid rate that it's almost impossible to keep up.These changes have led to a growing appreciation of the value of quality web copy. This appreciation has, in turn, led to an influx of opportunistic 'copywriters' promoting themselves as website copywriters or SEO copywriters. Don't get me wrong, there are quite a few excellent SEO copywriters out there, and you should definitely shop around. The purpose of this article isn't to scare you; it's to help you find the SEO copywriter who'll deliver honest service and e...(related: Seo)
Effective Keyword Optimization And Analysis Techniques
Keyword optimization involves vital keyword selection and placement strategy depends on successfully identifying your industry related important keywords and then where you can place those keyword for maximum effectiveness.Effective use of keywords optimization and Phrases on your websiteKey to successful web optimi...(related: Seo)
Screwed: Is This An Inevitability In The Seo World?
By about 2pm everyday, each of my team members has spoken to a good handful of clients and potential clients who have been speaking with other SEO firms. This an absolutely wonderful thing to see, as in the past in our industry, not eno...(related: Seo)
Everything You Wanted To Know About Google -- But Were Afraid To Search For!
(A Reflective look at the little search engine that soared!)All knowing, all seeing, ever present!Google has permeated into almost every aspect of life on this planet and beyond. It has become a mainstream fixture for computer and Internet users around the globe. All the while, cementing its position as the only real facilitator of the world's collective intelligence.Can you remember a day when you have not Googled?But Google's reach doesn't stop with the mouse or the cursor. It has moved beyond the computer scr...(related: Seo)
.com Not Listed In Regional Yahoo? Don?t Despair!
If you're a non-American business with a .com web address, and your regional Yahoo ranking is important to you, then my story might interest you.Recently my copywriting website dropped out of Yahoo's Australian rankings. For quite ...(related: Seo)
What Are My Chances To Get The First Place In Search Engine Listings?
You must have heard the stories how people became rich and famous with their websites. How could they achieve this? Their websites took a first position in search engine listings targeting popular keywords. Sounds easy, right? Wrong! To be honest, chances for a regular small business website to get to the top of the search engine listings are close to zero and each day they become smaller and smaller as a number of new and old websites grows.Let's exam...(related: Seo)
Googles New Seo Rules
Google has recently made some pretty significant changes in its ranking algorithm. The latest update, dubbed by Google forum users as "Allegra", has left some web sites in the dust and catapulted others to top positions. Major updates like this can happen a few times a year at Google, which i...(related: Seo)
site-map - Copyright © 2006 | You can send your articles and get links. Contact Webmaster | All Rights Reserved. | Seo