Note: Descriptions are shown in the official language in which they were submitted.
CA 02443036 2003-09-14
L4/09/03 Yaron Mayer 2/50
Background of the invention
Field of the invention:
The present invention relates to improved searching on the Internet or similar
networks and especially Meta News and/or improved automatically generated
newspapers, and more specifically to a system and method for improved
automatic
collection and displaying of news items on the Internet.
Back _rground
The Internet makes it possible for users to access vast amounts of
information, thus
becoming effectively the world's largest library and the world's largest
database. This
opens up fascinating new possibilities, such as for example automatically
accessing a
huge amount of news sources in order to present to the user for example an
automatically edited "news paper", which automatically selects the most
important
events or news items according to various criteria. However, one of the
biggest
problems is integrating efficiently vast amounts of information and analyzing
it.
Google has recently made available at http:!/news.~oo~le.com an automated
"newspaper", which searches continuously about 4,500 news sources, and lets
users
view automatically generated headlines in one of a few general areas (which
are
currently: Top Stories, World, US, Business, Sci/Tech, Sports, Entertainment
and
Health), or one newspaper divided to the above sections, or lets users search
for news
by keywords. In addition, users can choose between a number of possible
countries
(which are currently: Australia, Canada, France, Deutschland, India, lt:alia,
New
Zealand, U.K., US), and thus news items can change according to the chosen
country.
The automatic determination of which news items or news stories are most
important
is done by 3 main criteria: In how many sources the news item appeared, how
important are the news sources in which it appeared, and how close is it to
the top in
each of these news sources.
However, many problems still remain, such as for example:
1. The current system chooses for each headline just one of the possible
sources
(Including the first sentence in that news item) and also a photo from one of
the
possible sources (typically from another source), and typically indicates
below in
smaller print a few additional related headline links below, and then a few
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 3/$0
additional names of news sources below, which also link to related items, and
then
there is a final link to typically a few hundreds of additional related links.
This
leads to the following problems:
a. The choice of a single main news source and a single image for each item
seems arbitrary to the user and leads him to prefer this source for reading
the full news item, since he has much less information about the other links.
b. Similarly, the choice of the additional smaller links below also seems
arbitrary to the user.
c. Due to space limitations the clustering possibilities in the first page are
limited, so if for example there is room for only 2-4 main news items in
each category, then very board loosely related items might be presented as a
single news item.
d. If the user clicks on the final "related items" link, he typically gets
hundreds or even more than a thousand links to related news items (with the
headline, source, time, and the first 2 lines), sorted either by relevance or
by
time, however, the new list is now without any images and without any
clustering, so that many times news stories that are about the same event or
even identical (for example due to two or more news sources using exactly
the same item from a news agency), may appear at different positions in the
list of related links, and various other news items which are more different
appear between them and are typically also dispersed in various places.
This makes it vary hard for the user to take advantage efficiently of the list
of related items. (Although clicking on the next 30 links each time may
eventually show for example only for example 25-30% actual links due to
removing some very similar entries, like Google does also with normal web
pages results, this still leaves the shown items un-clustered, as explained
above).
2. Allowing the user to choose between a few top categories is very limited by
nature
and does not even come close to the true potential of such systems. On the
other
hand, when searching by keywords, the user immediately reaches a list of
results
that is similar to the list that he reaches when clicking on the final list of
"related
items", as explained below, and thus is subject to the same limitations.
Although
many times this first list shows for some of the items, especially in the
beginning,
a few additional sub-items and a link that says "and more", clicking on the
"and
more" links always apparently generates only a completely linear and non-
clustered list again, like in the case of clicking on the "related items"
links in the
automatic newspaper front page, as explained above. For example, searching for
the world "Israel" in Google news shows that there are 12,600 items, and the
2°d
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 4/50
results has the headline Israel Wants to Exile Arafat - - But Not ~'et, with a
few
additional smaller links and the "and more" link. But clicking on the "and
more"
list brings up a linear list that says that there are 1,010 items, and now
there no
clustering at all (except for deleting entries as explained above). Also,
sorting by
date always seems to create only a linear list with no clustering at all, even
when it
is the first list generated by searching for the keywords.
Thus, it would be highly desirable to have an improved News MetaSearch or
improved automatically generated "Newspaper" which solves the above problems
and
preferably adds also many additional useful features. Other problems with
other types
of searches are also explained and solved below.
Summary of the invention
The present invention tries to solve the above problems by at least one of the
following ways:
1. Preferably instead of one constant headline in each position the user can
click
on something and switch between similar headlines (preferably those that are
automatically generated as most important within the specific news item),
and/or for example the chosen news source changes automatically, preferably
at the same position on the screen (for example changes instantly at the same
position, or for example changes by using effects such as fade-in and fade-out
or scrolling). This automatic switching can be for example between the top 1-
30 automatically chosen top related headlines (preferably showing each time
also the first sentence or more) and when the user clicks anywhere on that
position, he is preferably transferred immediately to the news item that is at
the
position at the tome that he clicks on it. Preferably each such headline
(preferably with its first sentence or part of it) is kept long enough for an
average user to read it (for example 30-60 second), and preferably even if
this
switching is automatic the user can interfere for example by clicking on the
item or next to it, and thus move the switching for example backwards or
forwards. Another possible variation is for example to allow the user to click
on something near the main item in order to expend the list of switching items
next to each other, preferably without changing the rest of the layout, or for
example to open a menu window which allows to choose any one of them in
the window. Similarly, the image preferably keeps changing (for example in
correspondence with the current source that is in that place in the textual
part,
or independently) preferably automatically for example every few seconds,
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 5/50
thus switching between the sources and letting the user view for example 10-
30 relevant images instead of just one, which makes the whole experience
already more similar to TV. This changing of the image can again be for
example instantly, or for example with fade-in and fade out, or any other
affects. Another possible variation is to use similar preferably automatic
changes also for example in the smaller links below the main link. Again,
preferably if the user clicks on the image area, he is preferably instantly
transferred to the relevant news item in the relevant news source for the
image
that is visible at that position at the time of clicking. Another possible
variation
is showing for example simultaneously more than one main link and/or more
than one image for that item. Another possible variation is, when available,
showing instead of still images or in addition to them, also streaming video
from these news sources, however in this case the automatic switching of
images is preferably either disabled so that for example the user has to click
on
something in order to view related streaming data from a different source or
other still images, or for example each streaming source preferably remains in
the position for a longer time than still images until switching to the next
streaming source (or for example to the next still image).
2. Preferably if the user clicks on the "additional related items" link or
searches
for keywords, instead of receiving a problematic linear list as explained
above
in the background, he preferably receives a clustered list, so that the
related
links or the keyword search results are preferably again clustered according
to
the similarity of the items, thus enabling preferably recursive clustering,
preferably like a tree (However, since the same news item or sub-cluster might
belong to more than one cluster or sub-cluster, preferably it is shown and/or
can be reached from preferably all the sufficiently relevant clusters or sub-
clusters to which it belongs or is related). Preferably the user can indeed
choose at least between the options of ordering by time & date and ordering by
relevance, but preferably this helps to create order between and/or within the
sub-clusters, but preferably without interfering with the cluster structure
itself.
In other words, even sorting by date preferably does not contradict the
clustering, unless for example the user requests explicitly to sort by date
without any additional sub-clustering. Another possible variation is to allow
for example also a combined sorting, so that for example the items or sub-
clusters are sorted by days or by hours, and for example within each hour
frame or within each day frames they are sorted for example by relevance (for
example within and/or between the sub-clusters). Another possible variation is
to allow the user for example to request to sort the items by the country of
the
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 6/50
source, so that for example the news items are clustered in addition or
instead
also according to the country of the news source, so that for example the user
can see if there are clear difference in the way the same news story is
depicted
in different countries. Instead or in addition, preferably the user can choose
in
this list if he/she wants to see the list with at least one photo near each
item,
when available, (preferably from the same item in the same source), or without
photos. Preferably by clicking on a certain cluster the user can again view a
list
generated for that cluster, preferably again divided into smaller clusters,
however at each stage preferably the user can also simply view specific news
items of the cluster. Another possible variation is to let the user view for
example a graphical or textual hierarchical representation which preferably
shows for example at least one typical headline for each sub-cluster or for
example all of its individual headlines, and preferably shows multiple levels
of
the hierarchy at the same time, or for example the entire hierarchy from the
first general cluster down to the final nodes or down to the lowest sub-
clusters,
so that the user can simultaneously view the multi-level structure of related
types of items and choose directly to focus on the sub-cluster or sub-clusters
that most interest him. Preferably the user can also switch for example
between
a graphic or textual tree mode to the mode of just seeing the clusters at each
stage. This is very important, since, unlike normal web ages, news items
typically refer to specific events, so if for example 500 news items refer to
about 10 different but related news items, it is much more meaningful to show
the various sub-clusters than to just sort them for example by relevance or by
the exact time and date, since if for example 50 of them deal with the same
event, it is less meaningful to define which of them is more "relevant". These
improvements can have the following fascinating implications:
a. It means that by searching for interesting keywords or keywords
combinations (for example "homeland security", "rain forests", "science
fiction", or any other subject, common or less common), preferably the
user can instantly view an automatic "newspaper" that deals with the
requested subject (since clustering the first list generated according to
the keywords and requesting an image near each cluster or each item
can cause the list to look like the default initial automatic newspaper
front page). Preferably these images are represented in the MetaNews
system as links to these images in the actual news sources, in order to
save space on the MateSearch system's own servers. The images can be
displayed on the results page for example in the original size that they
have on the source news page where they appear. Another possible
CA 02443036 2003-09-14
14109/03 Yaron Mayer 7/50
variation is that for example in order to save bandwidth and/or in order
to keep the size of the images under control for more regularity in the
outlay of the results page, preferably the html protocol and/or the html
command set is expanded to allow any image to be requested with a
given size limit, so that preferably if the original image is bigger it is
either truncated automatically to fit in the allowed window, or is for
example automatically downscaled in order to fit completely into the
allowed space (preferably this is done by the user's browser or for
example by the original server). If truncation is used then preferably the
improved html protocol allows the web programmer for example to
specify for each image the x-y coordinates of its central point of interest,
so that the transaction can automatically be around that central point.
Another possible variation is that for example various heuristics are
used by the browser (or by the server) in order to find the central point
of interest automatically, such as for example finding the human face in
the image, starting automatically from the geometrical center, etc.
Another possible variation is that the Metanews system for example
automatically chooses only images that are within a certain reasonable
range of sizes.
b. It means that by using the same or similar rules recursively, the user can
preferably zero-in on a specific type of news item and see in an
organized way for example the same event from different angles. This
can be used for example in order to read about all the implications of a
certain event, and/or for example in order to analyze for example the
types of responses of the world press to certain events. So for example,
a news item about Israel's intent to expel Arafat, which in the prior art
Google News system leads to large variation of 827 related and partially
related news items, will instead lead to a page which leads to a
hierarchical tree of related types or sub-clusters of items, for example
some dealing with What Israeli leaders say, some about what world
leaders are saying, some about the new Palestinian Cabinet, some
represent views in favor of the expulsion, some against, etc. The clusters
can be for example shown all the way down to the final leaves through
multiples levels of the hierarchy, or for example only for the current
level, which means that preferably simply the same or similar algorithm
that was used for selecting the first page is now applied for example to
the selected group of 827 related items. Preferably the automatic
switching between images and/or between the main items on focus
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 8/50
(which preferably includes at least the 1 s' sentence or part of it), is also
applied similarly on each displayed page in the recursive sub-clustering.
3. If streaming video is used for example in a few or more of the news sources
that deal with or are related to the same event (i.e. the same cluster or same
sub-cluster), then preferably the user can also request for example an
automatic
formation of a group of these sources on the same screen so that they can be
viewed simultaneously, for example like a split screen in cable TV, except
that
the group is preferably automatically generated dynamically according to the
item of interest and according to current availability. So preferably the user
can
see for example a few or more preferably small streaming media images on the
same screen at the same time and preferably can also for example switch the
sound each time to one of them and/or for example there is a volume control
near each of them. By clicking for example on or near one of them the user is
preferably transferred to that source to view it normally there. Preferably
the
user can switch to the multi-view of the streaming images next to each other
for example by clicking on something near the original preferably
automatically switching image.
4. Preferably as additional new related news items come in, the headlines
andlor
images can be automatically updated even if the user does not click on any
refresh button. For example if there is a report on a new suicide bombing in
Israel, as additional details come in and the same items in the various
sources
become more updated or new items are added, preferably this is also
automatically updated in the automatic news page that the user has in front of
him (for example if the headline or the first sentence have changed or the
images have changed). This is preferably done by automatic partial refresh on
a need basis, as explained already in Canadian application no. 2,432,817 of
Jul.
4, 2003 (and in subsequent continuations of that application in the US and
Canada) by the present inventor, as explained below, and preferably by
grouping identical data packets in groups so that each group contains a single
copy of the identical data packet together with a multiple list of targets, so
that
each group preferably goes to a certain general area, and when it reaches that
general area the data is preferably duplicated back into the individual
packets,
or into smaller groups with less targets, which are later split up into the
individual packets, as explained for example in PCT application PCT/IL
01/01042 of Nov. 8, 2001 and US application 10/375,208 by the present
inventor. Similarly all the data and especially for example any streaming
video
images are preferably distributed this way to the large number of the
automatic
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 9/50
news viewers (for example from the original servers to any mirror sites of the
service and from any original server or mirror site to the users). However,
since, as explained above, headlines and images preferably keep changing
anyway between items of the relevant cluster or sub-clusters, preferably the
user gets a different indication when the items and/or images themselves have
changed (for example the same item has been updated on the news source
where it resides or the image has changed) or new items or images are brought
in, such as for example some sound indication, preferably accompanied with a
visual indication of the new item or the item that has changed, such for
example some red frame around it, and/or for example the words "Fresh
update" near it, etc. The vocal indication has a further advantage, since the
user
can be alerted for example even if he is currently working on another window.
Of course various combinations o the above and other variations can also be
used.
The detailed embodiments below show in more details also various
implementation
issues that can help solve various additional problems involved in supplying
the above
features.
Similar methods, but with the appropriate relevant adjustments, can be used
for
example for creating more sophisticated shareware meta-search service: For
example
shareware programs should appear in higher places in the meta search results
according to at least one of the following:
a. How many of the included shareware sites list them.
b. In which position they are listed for the given searched keywords.
c. How important the shareware site is (so that for example larger or more
central
major shareware search sites are preferably given at least some higher
weight).
d. How many times they were already downloaded (in each site that gives this
data, except that preferably the data is normalized by the general amount of
listed downloads in that shareware site, for example by comparing it the other
sharewares that are listed on the same search results page, or by keeping such
data for example in general for each shareware site across multiple searches)
e. The shareware site's rating for the shareware, if available (for example
based
on user votes and/or on their own editorial stuff). If based on user votes,
the
rating of that shareware site for the shareware it is preferably given higher
weight than an editorial decision in another site, if the number of votes is
given
and is sufficiently large. (This rule is preferably used both between sites
and
across sites, so that if for example the same site shows both editorial rating
and
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 10/50
user votes for the same shareware, then preferably the user votes are
preferred
if a sufficiently large number of users have voted).
If the same shareware appears for example in different versions in various
shareware sites, then preferably the system can for example use also the
rankings of
the previous versions (for example according to one or more of the above
criteria) for
determining the score for that shareware in general, or for example the system
uses in
this case clusters and sub-clusters like in the meta-news, or for example the
system
treats each version independently like any other shareware. Of course, various
combinations of the above and other variations can also be used.
In the normal Google web pages search engine there are also a few improvements
that can be made in order to solve various problems as explained below.
Preferably at
least one of the following improvements is done:
a. According to the thorough review of Google technology at
http://pr.efactory.de, the normal Google PageRank algorithm, which takes into
account how many incoming links each page has and how important or
authoritative each linking page is (this is defined by how high is the general
PageRank of the linking page), also takes into account the number of outbound
links for each page, but in a negative way: pages that have more outbound
links lose from their own PageRank score, and incoming links from other
pages are given lower weight the more other links there are on the linking
page. So for example if page A has incoming links from pages X, Y and Z
(from other sites), the PageRank score of A is considerably higher if pages
X,Y,Z each have on average for example 3 outgoing links than if they have on
average for example 10 outgoing links each. However, this has the
consequence of reducing the principle of giving more weight to links form
more important or more authoritative pages, since for example a link from a
directory page in Yahoo or in Open Directory would thus have a lowered value
since each linking page there has a typically a large number of outgoing
links.
On the other hand, reducing the value of the link according to the number of
other outgoing links on the linking page does have the advantage that it can
reduce for example the effects of submitting a web page to multiple giant junk
directories just in order to increase the number of links to that page. But on
the
other hand, such giant junk directories might be for example artificially
created
in a way that works around this anyway: For example by automatically
creating a special page for each linked page so that there is only one
outgoing
link on that page. Therefore, preferably the reduction in the weight of a link
according to the number of other links on that page is preferably eliminated
or
CA 02443036 2003-09-14
14J09/03 Yaron Mayer 1 1 /50
significantly reduced. Instead, preferably other algorithms are used in order
to
automatically discover specially designed junk directories and ignoring them
or giving them much lower weight. (This can be done for example by
identifying automatically certain recurring patterns in such junk pages, or
for
example by using usage data on the linking page in order to determine the
value of the links, so that if for example the linking page is in some junk
directory that is hardly ever visited, then the link will naturally have a
much
lower weight). On the other hand, the position of the link on the page is
preferably taken into account, so that a link in a higher place in the linking
page is preferably given higher weight, except that preferably the system
automatically notices if the links are sorted alphabetically on that page (for
example if it is a page in a web directory, such as for example Yahoo or
OpenDir), and in that case preferably the position is ignored since a higher
position is merely the result of the linked Web page having a name that
appears
higher on the Alphabet. In addition, it does not make sense at all to reduce
the
PageRank of page A just because page A has more outgoing links. On the
contrary, typically the more important a page is, the more outgoing links it
has,
since pages with no outgoing links are typically end nodes that deal with more
limited content. Also, the more important a site is, the more pages it
typically
has, but by reducing the rank due to outgoing links the Goggle PageRank
algorithm actually punishes web sites for containing more pages. Therefore,
another possible variation is to increase the PageRank in general for sites
that
have more pages and more outgoing links, except that of course incoming links
from independent sites should remain much more important then outgoing
links since otherwise people might add outgoing links just to boost their
rank.
b. Another problem with PageRank is that it automatically gives higher scores
to
older pages simply due to the fact that they have been around long enough to
have gathered more links to them, and, conversely, new pages might take a
long time to get a high listing in Google simply because at the beginning they
have no or too few links to them from other sites. In fact Google have
themselves noticed this problem and tried to solve it in US patent application
20020123988, filed March 2, 2001 and published Sep. 5, 2002, by
incorporating also automatic usage statistics for each page (from various
sources). However, first of all this does not solve the original problem,
since
older pages with more links, which are therefore already listed higher on the
Google directory, will typically also have by definition more visitors than
the
new page even if the new page is indeed more relevant to the search query.
Secondly, simply incorporating usage statistics into the score creates the
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 12150
danger of a classical "Mathew effect" of the rich getting richer and the poor
getting poorer. In other words, if usage statistics are simply incorporated
mathematically into the final score, then pages which currently have high
usage (a high number of visitors) for any reason (for example because they
gathered links to them over time and are therefore listed high in the Google
search results, or for example because some new site managed to convince
some journalist to write about it), then the increased usage can create a
snowballing effect of higher rank in Google, and therefore more usage, etc.,
and vice versa, good pages which have initially low usage can enter a negative
cycle of decreasing usage and being listed lower. In order to correct this
dangerous problem, preferably usage statistics are used only with one or more
thresholds, so that for example usage lower than a certain factor preferably
does not continue to lower the score, and usage higher than a certain factor
preferably does not continue to increase the score. This improvement is
extremely important since it allows using usage data while using at the same
time a mechanism for preventing it from causing vicious cycles (negative or
positive). Another possible variation is that usage statistics are used only
for
modifying the value of the link in the linking page but not for modifying
directly the ranking of a page. In addition, the problem of how long the page
has existed is probably solved by taking into account also historical data, so
that preferably for example a page that has existed for example for 3 months
and has already for example 20 valid links to it might have for example a
higher score than page that has existed for 3 years and has for example 30
valid
inks to it. So preferably the time factor is taken into account for
determining
the weight given to the number of links. (Of course the same algorithm can be
used whether any valid links are taken into account or for example only links
that seem to be related to the searched keywords are taken into account).
Again, preferably at least some threshold is used, so that 0 links or too few
links are not compensated by the fact that the page is new, but if the new
page
has already sufficient valid links, for example at least 10 links (or any
other
reasonable threshold number) from other sites that preferably do not reside on
the same IP address and their domain is not owned by the same person or
organization, then the newness of the page is preferably taken into account in
requiring less links at that stage. From the point of view of older sites this
also
makes sense, since this means that if a page for example has 50 valid links to
it
since it has existed for a number of years but the number of links does not
continue to increase over time then probably the site is really not so
important,
whereas a really important site would continue to gather more links over time,
CA 02443036 2003-09-14
14%09/03 Yaron Mayer 13/50
thus compensating for the fact that more time has passed. However the system
preferably has to use historical data to determine how long a page has
existed,
since it obviously cannot rely for that on any info on the page itself or on
the
site where the page resides. Archives such as for example the Internet
archives
at http:/lwww.arehive.or~ cannot be relied upon since not every page is
indexed there, and also they contain much more data that is not necessary for
this, such as for example the historical content of each page for example in 1-
month jumps or any other temporal jumps. Instead, preferably the system
itself,
for example Google, preferably keeps historical records which can contain for
example just the URL of each page and the time when it started to appear.
c. In addition, Google typically uses also the anchor text of inbound links to
determine the relevance of the linked page to the searched keywords, so that
for example if the user is searching for the keywords "free sex", instead of
being fooled by numerous not-really-free pages that use these words
extensively to fool search engines to give them a high rank for these popular
search keywords, the meaning of this is that Google in fact relies on the fact
that if links in other independent sites state in the link itself that this is
indeed a
free sex page, then probably the human who made the link checked and found
out that the linked page is really free, for example. In fact, Google itself
did not
invent this idea, since in the basic Google US patent 6,285,999, originally
filed
in a provisional application on Jan. 10, 1997, and issued on Sept., 4, 2001,
Larry Page indicates that this basic idea was already used before by the
"World
Wide Web Worm" and by "Hyperlink Search Engine", developed by LDD
Information Services. On the other hand, this idea is preferably further
improved to include at least some semantic analysis of the anchor href text
and/or preferably also at least the surrounding nearby text, or at least for
example the immediate text preceding the link. This is important since in the
above example if for example the text of the link or the text preceding the
link
says that the following linked page are not really free sex pages or are for
example only partially free, and the system only analyzes the fact that both
the
word free and the words sex appeared in the anchor text or near it, then the
system can still be easily mislead. So preferably the analysis of the href
text
and/or for example the surrounding near text preferably at least takes into
account some basic language structures such as for example negation words, or
modifying words, such as for example "really", "partially", etc., and thus is
preferably at least able to identify at least part of the meaning and/or avoid
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 14/50
certain pitfalls that are relevant to the interpretation of the real meaning
of the
link.
d. Another possible improvement, which can be used also in other types of
search
engines or metasearch engines, is to include for example in the keywords
search (for example in the general web search or in the news Meta-Search or in
the newsgroups search and/or in other types of search) also synonyms, so that
for example if the user searches for the keywords "deport Arafat" and the
system's synonym database suggests that deport is a close synonym of expel
and the system for example finds that there would be for example more or
much more relevant results if the user had used the keywords "expel Arafat"
instead, then the system can for example automatically include in the
displayed
search results also the pages that contain the keywords "expel Arafat", or for
example the system asks the user if he would like to consider also for example
close synonyms (and preferably remembers that as default for that user for
following searches, for example in a browser cookie file), or for example the
systems responds in a way similar to the way that Google responds today if
there is a typing error. So for example if the words "deport Arafat" lead to
for
example 200 relevant pages (for example in the recent news search) but the
words "expel Arafat" lead to for example 470 pages, (or for example any
number larger than the exemplary first 200 or any number larger by a certain
minimal difference or minimal factor), then preferably the results search page
can for example display the results and ask the user at the top "did you mean
expel Arafat?" in this example. In this case, preferably the system also
indicates to the user already with this message how many results instead would
be on the other search. More preferably, the system can ask the user for
example "would you like to include also results with expel Arafat?", and in
this case this message preferably indicates the number of results that would
be
in the combined search results, and then if the user clicks on that link then
both
types of results are preferably integrated, as explained above. In summary,
preferably the system can do at least one of the following: 1. Automatically
include in the search results also pages that contain synonyms or close
synonyms of the requested keywords. 2. Ask the user if he would like to
include in the search results automatically also pages that contain close
synonyms of the requested search keywords and remember that as default for
that user for following searches. 3. Check at least close synonyms of the
user's
search keywords, and if there are more and/or better results with the synonyms
then the system preferably asks the user for example if he wants to switch
over
to the results of the search that was based on the synonyms, and/or asks the
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 15/50
user for example if he wants to integrate the current results with the results
of
the search that was based on the synonyms. This is a most significant
improvement that can help users and significantly enhance the efficiency of
searches, since many times the biggest problems of users is that they don't
know the most appropriate keywords to search for or don't know all the most
relevant ones. Similar principles can be used for example while searching for
patents for example at the USPTO, since many times users can miss relevant
patents for example because they are not searching properly for all the
relevant
keywords.
e. Another possible variation is for example to allow the user to define
various
parameters for scoring the results, preferably on certain allowed ranges, such
as for example the relative weight of usage statistics, the amount of
reduction
of the importance of a link as a result of the total number of links on the
linking page, the amount of taking into consideration the newness of a web
page so that less links to it are required, etc. These values are preferably
remembered for example in a browser cookie, and the system preferably
displays to the user on each search the parameters that are currently
effective.
This can give users an additional important flexibility and control, instead
of
being dependent on sometimes arbitrary decisions by the search engine.
f. In addition, if usage statistics are collected, preferably from the browser
or
from a plug-in in the user's browser, preferably they include additional
information, such as for example the typical link-clicking sequence when a
user enters a site and start going over its links, the average time the user
spends
on each site altogether or on each page in the site until moving to another
site,
etc. Such a measure is problematic since the user might for example open
additional links in new windows but keep browsing the original page, so
preferably the browser itself (or the plug-in) for example checks if the user
is
still actively moving within the page. This is why it is preferably done by
the
browser or by a browser plug-in, since for example routers on the way can
provide statistics of requested pages for each requesting IP, but cannot know
what really happens on the side of the client. In addition, preferably the
browser or plug-in also requests from the user, preferably during
installation, at
least minimal background data, such as for example at least sex, age and
education, and the user's country is preferably known automatically according
to his IP or his Operating System settings.
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 16/50
Of course, various combinations of the above and other variations can also be
used.
Also, at least some of the above improvements can be used also in various meta-
search engines (in addition of course to News meta search engines), so that
for
example a web meta search engine such as for example Metacrawler can similarly
apply for example the above variations of including synonyms to the collected
search
results of other search engines.
Definitions and clarification
Throughout the patent whenever variations or various solutions are mentioned,
it is also possible to use various combinations of these variations or of
elements in
them, and when combinations are used, it is also possible to use at least some
elements in them separately or in other combinations. These variations can be
in
different embodiments, or different versions of the software, or sometimes
different options available to choose from. In other words: certain features
of the
invention, which are described in the context of separate embodiments, may
also
be provided in combination in a single embodiment. Conversely, various
features
of the invention, which are described in the context of a single embodiment,
may
also be provided separately or in any suitable sub-combination.
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 17/50
Brief description of the drawings
Fig. 1 is an example of the look of a typical Google automatic
"newspaper" front page (prior art).
Fig. 2 is an example of the look of a typical list generated in
http://news.<T~~le.corn
after clicking on the list of related items of a given item (prior art).
Fig. 3a is an example of a preferable way that the list of related items (or
the list
generated by searching for news by keywords) can look after clustering it
again like
the automatically generated front page.
Fig. 3b is an example of a preferable way that the list of related items or
the list
generated by searching news by keywords can look when showing multilevel sub-
clustering at the same page.
Figs. 4a-b are examples of a preferable way in which the headlines and/or the
image
of each item can scroll automatically between a number of sources.
Fig. 5 is an example of a preferable way in which multiple streaming video
images of
the same event from various Online news sources can appear on the screen side
by
side.
Fig. 6 is an example of a condensed packet for much more efficient
distribution of the
same data to multiple users.
Detailed description of the preferred embodiments
All of descriptions in this and other sections are intended to be illustrative
examples
and not limiting.
Referring to Fig. 1, I show an example of the look of a typical Google
automatic
"newspaper" front page (prior art). As can be seen, the prior art system
chooses for
each headline just one of the possible sources as the main item (Including the
first
sentence in that news item) and usually also a photo from one of the possible
sources
(typically from another source), and typically indicates below in smaller
print a few
additional related headline links below, and then a few additional names of
news
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 18/50
sources below, which also link to related items, and then there is a final
link to
typically a few hundreds of additional related links.
Referring to Fig. 2, I show an example of the look of a typical list generated
in
http://news.~oagle.com after clicking on the list of related items (prior
art). In this
case the item that was clicked on was the item about the talks about deporting
Arafat.
As can be seen, this generates a linear list with no clustering at all, and
various items
that should clearly be in the same sub-clusters are dispersed in different
places.
Referring to Fig. 3a, I show an example of a preferable way that the list of
related
items (or the list generated by searching for news by keywords) can look after
clustering it again like the automatically generated front page. As can be
seen,
preferably this can be very similar or even identical to the front page in any
of the
general areas, except that there might be for example less sub-clusters and
less photos,
since only some of the individual news items contain photos that can be used,
so for
example sometimes an entire sub-cluster might be without a photo. As explained
above in the patent summary, preferably the user can switch between a mode
that
shows photos to a mode without, and preferably the photos and/or the main news
items and/or the related smaller items below can switch for example
automatically,
for example every 30-60 seconds within the same area on the page and/or the
user can
move backwards and forwards with them. Since this is a recursion, any of the
improvements described for the main page can preferably also be implemented
here,
such as for example all the improvements shown in Figs 4a & 4b. Preferably the
recursive clustering continues for example until there are sufficiently few
items in the
final sub-category or until the items are too different to group further. As
can be seen
in this example, the general items about talks about expelling Arafat are now
preferably divided into reasonable sub-clusters, such as for example the
response of
Arafat's supporters, the US response, talks about killing Arafat instead of
deporting
him, etc. In order to enable the smarter mufti-level sub-clustering, first of
all, in
general, the same or similar principles are preferably applied similarly at
all levels,
except that in each step they are preferably applied now to the items of the
previous
cluster or sub-cluster in order to further divide them into additional sub-
clusters.
In order to improve the clustering ability, preferably at least one or more of
the
following methods are used:
1. Preferably the time each item was published is taken into account,
preferably
with the assumption that the closer the time of publication between them, the
higher the chance that two items are dealing with the same event. Another
possible variation is to analyze also the temporal words or phrases used
within
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 19/50
the item itself (preferably mainly in the headline and/or in the first few
sentences), since if for example some event has occurred 30 minutes ago, then
any news items that are older than that cannot be reporting about the same
event (although they might have mentioned it even before the event for
example in case of a prescheduled event, such as for example a sports event or
press conference or a ceremony, these items will typically be different from
items that describe the event itself after it has already happened). In other
words, the system preferably uses this analysis to decide when the event
occurred, and this time can be used for example to separate between news
items that occurred before this time and items that occurred after this time
and/or to help decide the similarity between items that might be referring to
the
same event. In order to enable this, preferably the system is able to perform
also at least some minimal type of semantic analysis and/or preferably has at
least knowledge of the relevant temporal nouns (such as for example months
names, weekday names, relative terms, such as for example yesterday, today,
tomorrow), and relevant verbs (such as for example before, after, during, on),
etc. Preferably this includes also various different ways of writing the same
dates or times, such as for example with numbers, with names or with
abbreviated names (for example Sep. 9 instead of September 9, etc).
2. Similarly, preferably the system has at least a knowledge base of
geographic
areas, such as for example at least country names and city names, so that for
example when the same place appears in two different news items, preferably
in the headline and/or for example in the first I or 2 sentences, the system
can
give it more weight than ordinary keywords. The headline and the first 1 or 2
sentences are most important, since according to common journalistic rules,
all
the important information of the 5 W's should already be in there (Who, What,
Were, When, and sometimes also Why). Again, preferably this includes also
different ways of writing the same names, if they are exist.
3. In addition, preferably the system has a knowledge base of at least the
most
common or most important verbs that typically appear for example in
headlines and/or in the first one or two sentences of news items (or even in
entire news items). (The original verb list can be for example generated
statistically automatically by analyzing a large number of news items, and
then
human experts preferably define the knowledge base at least for these most
common or most important words). Preferably the knowledge base uses for
example semantic trees and/or semantic graphs and/or various rules, so that
for
example the system knows that killing is much more severe than expelling or
deporting, and preferably knows for example that the words "said" or
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 20/50
"accepted" or "opposes"or "demands"refer to transfer of information
(and
preferably also the them on various dimensions,
differences between such as
for example giving on the level of negativity,
each word a score level of
severity, level of
urgency, etc.),
and that for example
words like "expel"
or
"kill" refer to physicalactions, o for example each verb might
etc. S be
characterized by
scores (for example
between 0-10 or
any other suitable
range,
or at least a binary
characterization)
on a number of relevant
variables or
dimensions, for example:
PresentPast Physical Information Reversible Typically Typically
Pos/Neg
Done by Done to
say said No Yes Undef Yes Humans Humans/Animals
tell told No Yes Undef Yes Humans Humans/Animals
acceptaccepted No Yes Pos Yes Humans Anything
agreeagreed No Yes Pos Yes Humans Anything
opposeopposed No Yes Neg Yes Humans Humans/Rules
expelexpelled Yes No Neg Yes Humans Humans
deportdeported Yes No Neg Yes Humans Humans
kill killed Yes No Very-Neg No Humans/Animals Humans/Animals
murdermurdered Yes No Very-Neg No Humans Humans/Animals
executeexecuted Yes No Very-Neg No Humans Humans
executeexecuted Yes No undef Yes Humans Action/Document
die died Yes No Very-Neg No Humans/Animals/Abstract
Self
breakbroken Yes No Neg No Humans/Animals Anything
On the other hand, a more hierarchical structure has the advantage that the
words themselves can be divided into various clusters and sub-clusters and for
example inherit various qualities from their parents in the tree (for example
"kill", "murder", "execute" and "die" are all related to ceasing to exist). In
addition or instead preferably the system includes also a thesaurus (which can
be for example based on existing databases and/or learned automatically from
various statistical analyzes of a large number of relevant texts). This way
for
example the system can know that killing Arafat is something much more
negative and irreversible compared to expulsion or deporting, or at least
something that is not a synonym of deporting
4. Another possible variation is to include at least a database of synonyms
for the
comparisons of nouns and/or of verbs, so that the system can know if two
words are different or similar even without "understanding" their meaning.
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 21 /50
5. Another possible variation is to supply the system for example in addition
or
instead with a knowledge base of major known political names and
organizations. Preferably all or at least one or more of the above methods are
also used at least for the most important other languages (Such as for example
Spanish, German, French, Chinese, and Arabic) preferably with links between
the corresponding words between these languages, so that the clustering can
preferably work OK also across languages. However, this is less important
since typically the users will want to view news items only in one language.
6. Another possible variation is to analyze the similarity between two news
items
not only by counting the number of occurrences of the same keywords
(According to a detailed article in httP:,~/pr.efactory.de;, Google currently
relies
mainly on counting the occurrence of keywords after deleting to most common
and the most uncommon keywords), but also the similarity in the occurrence of
word combinations, for example how many same 2-words combinations or
same 3-words combinations exists in both items (or for example the same 2
words with any 1 or 2 other words between them), or for example same 4-
words combinations or same 5-word combinations, etc.). Another possible
variation is that this analysis is preferably done only or mainly on the
headline
and/or on the first 1 or 2 sentences, which should be the most informative, or
the results of the analysis of the headline and/or first 1 or 2 sentences are
given
higher weight than the analysis of the rest of each item, or for example the
importance of each next sentence is decreased according to its position.
Another possible variation is for example to generate for the user also a
summary of the relevant cluster or of the relevant sub-cluster for example by
generating automatically the list of sentences or for example the list of
first or
2°d sentences that appeared most often in the items of the cluster or
of the sub-
cluster, or for example the sentences which have the largest number of sub-
combinations (for example 3 word combinations) that repeat in other items of
the cluster or of the sub-cluster. Another possible variation is to use this
method for example to highlight the most important sentences in a given
article
(for example by highlighting sentences which appeared in whole or in part
more that other sentences also in other items of the cluster or of the sub-
cluster
or for example by deleting the sentences that are not highlighted, however
deleting is less preferable since it can lead to loss of context). However,
since
the user preferably reads the article itself in the relevant news source site,
this
highlighting can be added for example dynamically by a browser plug-in.
7. Another possible variation is to take into account similarity in words even
if
they are not exactly identical, especially for example in the headline, so
that for
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 22/50
example if a name can be spelled in more than 1 way the system will note the
similarity, especially for example if the two names appear in a similar
structure
in two similar headlines.
Referring to Fig. 3b, I show an example of a preferable way that the list of
related
items or the list generated by searching news by keywords can look when
showing
multilevel sub-clustering at the same page. As can be seen, this has the
advantage that
the user can preferably see the entire tree structure with multiple levels of
hierarchy
and click directly on any final node (i.e. an individual news item at a
certain news
source), however this has the disadvantage of too much detail for clusters
that might
interest the user less, and altogether it is less visually appealing that the
variation of
Fig. 3a.
Referring to Figs. 4a-b, I show examples of a preferable way in which the
headlines
and/or the image of each item can switch automatically between a number of
sources.
For example, the CBS news image of Arafat shown in Fig. 4a can switch
automatically for example between for example 3-20 other related images
(preferably
determined automatically according to the number of relevant images
available), so
that for example each image stays for example for 5 or 10 seconds (or any
other
reasonable time) and the switch is for example instant or for example by fade-
in and
fade-out. As explained in the summary, the images or some of them might be for
example also sources of streaming data, in which case preferably an image
which is a
source of streaming data preferably stays longer before switching over to the
next
image. Similarly, the main item, and/or for example the sub-items or sub-
headlines of
the main item or main headline, can also preferably switch automatically
between a
number of items, for example the entire 27 items that exist in this example in
the main
sub-cluster of the larger cluster of 877 related items, or for example only
among the
for example 10 most important or most recent or most relevant of the 27 (or
any other
reasonable number or percent). However, this switch is preferably without
scrolling
effects and can be for example instantly or with some fade-in and out, and
preferably
each such text remains for the time needed to read it comfortably (for example
20-40
seconds). Another possible variation is to allow the user also to manually
switch
between the images and/or between the specific items within the main sub-
cluster
and/or within the sub-clusters represented by the sub-headlines, for example
by
adding the blue arrows for "Prev" and "Next" near the text and/or near the
image, as
seen in Figures 4a and 4b. In addition, as shown in these examples, preferably
clicking on the sub-headline, for example, Arafat dares Israel to kilt him
after cabinet vote,
will lead to the relevant specific news item, and the sub-headlines themselves
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 23/50
preferably each have a separate link to related items next to it, so that for
example
each such cub-cluster has a smaller number of links related to it. For example
in the
example about Arafat's suggested deportation on Fig. 4b there are 5 related
links to
the sub-headline "lsraeli defence minister says 'kill Arafat"', 6 related
links to the
sub-headline about the response of Arafat's supporters, 5 related links to "US
opposes
Arafat expulsion", and at the bottom there is the link to the list of 877
relates items,
which means the entire set of items that belong to the wider cluster (however,
as
explains above, even clicking on this link will preferably show the list of
877 items
clustered again into sub-clusters and sub-sub-clusters ,etc.). Another
possible
variation is to add for example a similar link also next to the main item, so
that it wil I
say for example in this case and 2~ re~ated » for example next to the first
sentence of the
main item, which is preferably the biggest sub-cluster, as shown in Fig. 4a.
Of course,
this is just an example and other similar configurations could also be used to
display
such clusters and sub-clusters, preferably together with their related links.
Preferably
the system determines which item to use as the main item of the general
cluster (for
example this general cluster of 877 items) by first picking the sub-cluster
that has the
largest number of items (and/or for example the most recent sub-cluster that
is big
enough relative to other sub-clusters) and then picking for example the item
within
this largest sub-cluster (or otherwise chosen first sub-cluster) which has for
example
the highest average similarity to other items in that sub-cluster and/or for
example
belongs to the largest sub-cluster of that sub-cluster and/or for example is
most
relevant within the cluster or within the sub-cluster and/or for example is
most recent
within the cluster or within the sub-cluster, etc. So if for example the
entire large
cluster of clusters that relates to Arafat's suggested deportation has 877
items, and for
example there are 27 items in the cluster about Israel deciding to deport
Arafat, and
other sub-clusters have less items, then this naturally becomes the main sub-
cluster
from which the main item or items are chosen, and for example the next two
largest
sub-clusters become the next two sub-headlines, etc. Another possible
variation is for
example to put first the more recent sub-cluster for example if it is large
enough or for
example if the difference in size between it and a larger less recent sub-
cluster is
small enough.
Referring to Fig. 5, I show an example of a preferable way in which multiple
streaming video images of the same event from various Online news sources can
appear on the screen side by side. If streaming video is used for example in a
few or
more of the news sources that deal with the same event, then preferably the
user can
also request for example an automatic formation of a group of these sources on
the
same screen, like a split screen in cable TV for example, except that the
group is
preferably automatically and dynamically generated according to the item of
interest
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 24/50
and according to availability in the various sources. So preferably the user
can see for
example 4 or 9 (or any other reasonable number of) small streaming media
images on
the same screen and preferably for example switch the sound each time to one
of them
(or for example the sound is not enabled in order to force the user to go to
the actual
site if he wants also the sound), and then by clicking for example on one of
them the
user is preferably transferred to that source to view it normally there.
Preferably the
user can switch to the multi-view of the streaming images next to each other
for
example by clicking on something near the original preferably automatically
switching image, for example the icon of a split screen or the words "Split
Screen",
shown next to the images in the example of Fig. 4a, so that preferably the
split screen
is created automatically by expanding the switching available still images
and/or
streaming images to appear together side by side. Preferably the split screen
can
contain for example also some normal images instead of just streaming data. If
there
are for example 20 available images for a certain cluster or sub-cluster, out
of which
for example 5 images contain steaming data, then preferably the system
organizes
first of all the streaming data images next to each other, and adds afterwards
the still
images. Since 20 images in this example might not fit on one screen, then
either the
user can use for example the browser's scroll lever on the side to view the
rest of the
images, or for example only 9 or 12 images are shown and the others for
example
continue to switch automatically or the user can for example press some button
to
switch between more than 1 split screens that were created. Preferably the
streaming
data or any other data is supplied to the users more efficiently by the same
mechanisms explained in the reference to Fig. 6. Preferably if one of the
sources for
example stops broadcasting the relevant streaming data, it can automatically
be
removed from the split screen or for example is replaced with a relevant still
image,
and if for example a new relevant data stream becomes available from another
source,
it can preferably be automatically added by the system to the split screen.
Referring to Fig. 6, I show an example of a condensed packet for much more
efficient distribution of the same data to multiple users. As explained in the
patent
summary, Preferably as additional new related news items come in, the
headlines are
automatically updated even if the user does not request any refresh. For
example if
there is a report on a new suicide bombing in Israel, as additional detail
come in and
the same items in the various sources become more updated or new items are
added,
preferably this is also automatically updated in the automatic news page that
the user
has in front of him (for example if the headline or the first sentence have
changed or
the images have changed). This is preferably done by automatic partial refresh
on a
need basis, as explained already in Canadian application no. 2,432,817 of Jul.
4, 2003
(and in subsequent continuations of that application in the US and Canada) by
the
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 25/50
present inventor, as explained below, and preferably by grouping identical
data
packets in groups so that each group contains a single copy of the identical
data
packet together with a multiple list of targets, so that each group preferably
goes to a
certain general area or direction, and when it reaches that general area the
data is
preferably duplicated and split up into the individual packets, or into
smaller groups
with less targets, which are later split up into the individual packets, as
explained for
example in PCT application PCT/IL 01/01042 of Nov. 8, 2001 and US application
10/375,208 by the present inventor. This is preferably done in combination
with using
a preferably hierarchical system of routers and Physical (geographical) IP
addresses
(preferably for example GPS based), as explained also in these applications.
Similarly
preferably all the data and especially for example any streaming video images
are
preferably distributed this way to the large number of the automatic news
viewers. As
explained in these applications, this efficient distribution can be used for
example
both when sending data to users and when sending data to various proxies or
mirror
sites such as for example Akamai servers. However, since, as explained above,
headlines and images preferably keep changing anyway between items of the
relevant
cluster or sub-clusters, preferably the user gets a different indication when
the items
themselves have changed or new items or images are added, such as for example
some sound indication, preferably accompanied with a visual indication of the
new
item, such for example some red frame around it, and/or for example the words
"Fresh update" near it, etc. The vocal indication has a further advantage,
since the
user can be alerted for example even if he is currently working on another
window.
The automatic partial refresh is preferably done as follows: In order to save
bandwidth for example the html protocol is preferably changed so that it is
possible to
define for example "refresh on a need basis", which means that the refresh
command
is initiated automatically by the site when there is any change in the page
(so that the
browser can get a refresh even if it didn't ask for it), or for example the
browser asks
for refresh more often (for example every 20 seconds or even less), but if
nothing has
changed then the browser gets just for example a code that tells it to keep
the current
page or window as is. The first of these two variations is more preferable
since it
saves also the waste of bandwidth by unnecessary refresh requests by the
browsers. In
addition, when the refresh is sent, preferably it can be a smart refresh,
which tells the
browser preferably only what to change on the page instead of having to send
the
entire page again. Another possible variation is to implement this "refresh on
need"
for example by active X and/or Java and/or Javascript and/or some plug-in or
other
dynamic code that is updated only when there is a need for it. Another
possible
variation is for example to keep the page open like a streaming audio or video
so that
CA 02443036 2003-09-14
14/09/03 Yaron Mayer 26/50
the browser always waits for new input but preferably knows how to use the new
input for updating the page without having to get the whole page again and
preferably
doesn't have to do anything until the new input arrives. Of course, like other
features
in this invention, the above features or variations can be used also
independently of
any other features of this invention, for example also independently of any
Metasearch or automatic "newspaper" application.
The structure of automatically condensed identical packets is illustrated in
Fig. 6.
Preferably the condensed packet (61) contains just a single copy of the
identical data
(62) and an extended header (63), which contains a normal header (65)
(preferably
with a mark that indicates that this is actually a condensed packet), and a
list (64) of
the preferably physical (geographic) IP target addresses of the original
packets that
contained the same identical data in their body and were condensed in this
group. So,
for example, when sending the same streaming data (or any other same data) for
example to millions of users at the same time, preferably one or more such
condensed
packets are created, preferably by the sending web server, and each condensed
packet
goes to a certain general target area, and as it reaches the general target
area the
condensed packet is preferably replicated and regrouped into smaller groups,
each
containing less target addresses, and eventually replicated back to single
packets with
a single target address each, as the packet nears its final destination. As
explained in
the above mentioned applications, this can lead to huge savings both in terms
of
bandwidth and in terms of the number of routing decisions that have to be made
on
the way.
While the invention has been described with respect to a limited number of
embodiments, it will be appreciated that many variations, modifications,
expansions and other applications of the invention may be made which are
included within the scope of the present invention, as would be obvious to
those
skilled in the art.