Note: Descriptions are shown in the official language in which they were submitted.
COMMODITY SHORT TITLE GENERATION METHOD AND APPARATUS
BACKGROUND OF THE INVENTION
Technical Field
[0001] The present invention relates to the technical field of text
abstracting, and more
particularly to a method and an apparatus for generating merchandise short-
titles.
Description of Related Art
[0002] Merchandise short-titles are generally formed by compressing a standard-
length titles of
merchandise items. As implied in the name, short-titles are simple, concise,
and short.
The purpose of short-titles is to describe key information of merchandise
items with the
least possible words so that users can get such key information at a glance.
An example
of a short-title is "Korean-cutting all-over print dress." This can be
regarded as a special
text abstracting technology in the sense of natural language processing.
[0003] The traditional text abstracting techniques, such as TextRank, and Lead-
3, are about
abstracting sentences from articles, and are not really suitable for
generation of
merchandise titles. With the rapid development of deep learning, various deep-
learning
models, like seq2seq and pointer-generation, can be used to generate
compressed short-
titles. However, without sufficient short-title trained corpus, these models
are not
applicable to practical applications, particularly for generation of
merchandise titles.
SUMMARY OF THE INVENTION
[0004] The objective of the present invention is to provide a method and an
apparatus for
generating a merchandise short-titles, which can generate merchandise short-
titles with
improved efficiency and precision.
[0005] For achieving the foregoing objective, in a first aspect, the present
invention provides a
method for generating a merchandise short-title, which comprises:
[0006] crawling merchandise title data and/or collecting search term data, so
as to construct a
1
Date Recue/Date Received 2023-10-25
corpus data set;
[0007] based on a merchandise category table, categorizing corpuses in the
corpus data set by
merchandise categories, and then extracting key words to construct a word
library;
[0008] tagging each key word in the word library as either a modifier word or
a category word
according to a part of speech of the word;
[0009] performing word segmentation on the original merchandise title data so
as to obtain plural
title words, matching each of the title words with the key words in the word
library,
respectively, and outputting the key words that have matches; and
[0010] sieving out at least two effective key words from the plural key words,
and stitching the
effective key words into the merchandise short-title according to their parts
of speech.
[0011] Preferably, the step of based on a merchandise category table,
categorizing corpuses in
the corpus data set by merchandise categories, and then extracting key words
to construct
a word library comprises:
[0012] based on the merchandise category table, categorizing the corpuses in
the corpus data set
one by one according to the merchandise categories;
[0013] performing word segmentation on the corpuses, respectively, so as to
obtain the plural
key words, and de-duplicating and then filtering the key words in every
merchandise
category so as to obtain key word sets each corresponding to a said
merchandise category;
and
[0014] uniting the plural key words sets to form the word library.
[0015] More preferably, the step of tagging each key word in the word library
as either a modifier
word or a category word according to a part of speech of the word comprises:
[0016] extracting the key words that are the modifier words or the category
words from the word
library by means of manual tagging and tagging the corresponding parts of
speech; and/or
[0017] extracting the key words that are the modifier words or the category
words from the word
library using a machine tagging model and tagging the corresponding parts of
speech
using a machine tagging model.
[0018] Further, after the step of extracting the key words that are the
modifier words or the
category words from the word library by means of manual tagging and tagging
the
2
Date Recue/Date Received 2023-10-25
corresponding parts of speech, the method further comprises:
[0019] crawling new merchandise title data, performing word segmentation
thereon, and
matching resulting words with the key words in the word library;
[0020] if a number of the key words that have matches is smaller than a
threshold, adding the
key words in the new merchandise title data into the corresponding key word
sets, and
tagging the newly added key words for their parts of speech; or
[0021] if the number of the key words that have matches is greater than the
threshold, crawling
new merchandise title data, performing word segmentation thereon, and matching
resulting words with the key words in the word library again.
[0022] Preferably, after the step of extracting the key words that are the
modifier words or the
category words from the word library using a machine tagging model and tagging
the
corresponding parts of speech using a machine tagging model, the method
further
comprises:
[0023] based on a semantic recognition technology in the machine model,
extracting the key
words that are the modifier words or the category words from the newly crawled
merchandise title data, adding them into the corresponding key word sets, and
tagging
the newly added key words for their corresponding parts of speech.
[0024] Preferably, the step of performing word segmentation on the original
merchandise title
data so as to obtain plural title words, matching each of the title words with
the key words
in the word library, respectively, and outputting the key words that have
matches
comprises:
[0025] recognizing the merchandise categories in the original merchandise
title data, and
matching them with the corresponding key word sets; and
[0026] segmenting the original merchandise title data into the plural title
words, matching each
of the title words with the key words in the corresponding key word set, and
sieving out
the key words that have matches.
[0027] Preferably, the step of sieving out at least two effective key words
from the plural key
words, and stitching the effective key words into the merchandise short-title
according to
their parts of speech comprises:
3
Date Recue/Date Received 2023-10-25
[0028] recording location information of each of the key words in the original
merchandise title
data;
[0029] if in the key words tagged as the modifier words, there are plural said
key words whose
lexical scopes have intersection, only one said key word in the intersection
is kept;
[0030] if in the key words tagged as the modifier words, there are plural said
key words in which
the lexical scope of one said key word contains the lexical scope of another
said key word,
only the key word has the largest lexical scope is kept;
[0031] if the key words tagged as the category words have word sense
containing word sense of
any said key word tagged as the modifier word, the key word corresponding to
the
modifier word is removed; and
[0032] defining the left key words as the effective key words, and stitching
them into the
merchandise short-title according to locational sequence thereof.
[0033] Optionally, matching the different original merchandise title data with
the word library,
respectively, performing parallel processing, and outputting plural
corresponding
merchandise short-titles.
[0034] Exemplarily, the search term data represent a collection of search
terms to be input by a
user for searching for a merchandise item.
[0035] As compared to the prior art, the method for generating merchandise
short-titles of the
present invention provides the following beneficial effects:
[0036] In the method for generating merchandise short-titles according to the
present invention,
a corpus data set is first constructed. Then, based on the merchandise
category table,
corpuses in the corpus data set as categorized. From the categorized corpuses,
key words
are extracted to form a word library. Every key word in the word library is
tagged as a
modifier word or a category word according to its part of speech. The word
library is so
established. Afterward, original merchandise title data are acquired and to be
compressed.
The original merchandise title data are segmented to obtain plural title
words. These title
words are entered into the word library to be matched with the key words. From
the key
words that have matches, at least two effective key words are sieved out, and
stitched into
a merchandise short-title according to the order of their parts of speech.
4
Date Recue/Date Received 2023-10-25
[0037] It is thus clear that the present invention categorizes corpuses before
tagging them,
thereby effectively reducing difficulty of the tagging process and tagging key
words more
efficiency. By segmenting the original merchandise title data and directly
matching the
data with the key words in the word library, the sieved and stitched
merchandise short-
title is more precise.
[0038] In another aspect, the present invention provides an apparatus for
generating merchandise
short-titles, which is applied with the method for generating merchandise
short-titles as
described above. The apparatus comprises:
[0039] a data collecting unit, for crawling merchandise title data and/or
collecting search term
data, so as to construct a corpus data set;
[0040] a word library unit, for based on a merchandise category table,
categorizing corpuses in
the corpus data set by merchandise categories, and then extracting key words
to construct
a word library;
[0041] a word tagging unit, for tagging each key word in the word library as
either a modifier
word or a category word according to a part of speech of the word;
[0042] a word matching unit, for performing word segmentation on the original
merchandise title
data so as to obtain plural title words, matching each of the title words with
the key words
in the word library, respectively, and outputting the key words that have
matches; and
[0043] a processing unit, for sieving out at least two effective key words
from the plural key
words, and stitching the effective key words into the merchandise short-title
according to
their parts of speech.
[0044] As compared to the prior art, the disclosed apparatus for generating
merchandise short-
titles provides beneficial effects that are similar to those provided by the
method for
generating merchandise short-titles as enumerated above, and thus no
repetitions are
made herein.
[0045] In a third aspect, the present invention provides a computer-readable
storage medium, in
which a computer program is stored. When run by a processor, the computer
program
executes the steps of the method for generating merchandise short-titles as
described
above.
Date Recue/Date Received 2023-10-25
[0046] As compared to the prior art, the disclosed computer-readable storage
medium provides
beneficial effects that are similar to those provided by the method for
generating
merchandise short-titles as enumerated above, and thus no repetitions are made
herein.
BRIEF DESCRIPTION OF THE DRAWINGS
[0047] The accompanying drawing is provided herein for better understanding of
the present
invention and form a part of this disclosure. The illustrative embodiments and
their
descriptions are for explaining the present invention and by no means form any
improper
limitation to the present invention, wherein:
[0048] FIG. 1 is a flowchart of a method for generating merchandise short-
titles according to a
first embodiment of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
[0049] To make the foregoing objectives, features, and advantages of the
present invention
clearer and more understandable, the following description will be directed to
some
embodiments as depicted in the accompanying drawings to detail the technical
schemes
disclosed in these embodiments. It is, however, to be understood that the
embodiments
referred herein are only a part of all possible embodiments and thus not
exhaustive. Based
on the embodiments of the present invention, all the other embodiments can be
conceived
without creative labor by people of ordinary skill in the art, and all these
and other
embodiments shall be encompassed in the scope of the present invention.
[0050] Embodiment 1
[0051] Referring to FIG. 1, the present embodiment provides a method for
generating a
merchandise short-title, comprising:
[0052] crawling merchandise title data and/or collecting search term data, so
as to construct a
corpus data set; based on a merchandise category table, categorizing corpuses
in the
corpus data set by merchandise categories, and then extracting key words to
construct a
word library; tagging each key word in the word library as either a modifier
word or a
6
Date Recue/Date Received 2023-10-25
category word according to a part of speech of the word; performing word
segmentation
on the original merchandise title data so as to obtain plural title words,
matching each of
the title words with the key words in the word library, respectively, and
outputting the
key words that have matches; and sieving out at least two effective key words
from the
plural key words, and stitching the effective key words into the merchandise
short-title
according to their parts of speech.
[0053] In the method for generating merchandise short-titles according to the
present
embodiment, a corpus data set is first constructed. Then, based on the
merchandise
category table, corpuses in the corpus data set as categorized. From the
categorized
corpuses, key words are extracted to form a word library. Every key word in
the word
library is tagged as a modifier word or a category word according to its part
of speech.
The word library is so established. Afterward, original merchandise title data
are acquired
and to be compressed. The original merchandise title data are segmented to
obtain plural
title words. These title words are entered into the word library to be matched
with the key
words. From the key words that have matches, at least two effective key words
are sieved
out, and stitched into a merchandise short-title according to the order of
their parts of
speech.
[0054] It is thus clear that the present invention categorizes corpuses before
tagging them,
thereby effectively reducing difficulty of the tagging process and tagging key
words more
efficiency. By segmenting the original merchandise title data and directly
matching the
data with the key words in the word library, the sieved and stitched
merchandise short-
title is more precise.
[0055] It is to be noted that the data of the corpus data sets are obtained by
crawling the
merchandise title data and collecting the search term data. For crawling the
merchandise
title data, it is important to crawl merchandise short-titles from major e-
commerce
platforms. For collecting the search term data, search terms used for
searching for various
merchandise items, namely query data, are gathered.
[0056] In the embodiment, the step of based on a merchandise category table,
categorizing
corpuses in the corpus data set by merchandise categories, and then extracting
key words
7
Date Recue/Date Received 2023-10-25
to construct a word library comprises:
[0057] based on the merchandise category table, categorizing the corpuses in
the corpus data set
one by one according to the merchandise categories; performing word
segmentation on
the corpuses, respectively, so as to obtain the plural key words, and de-
duplicating and
then filtering the key words in every merchandise category so as to obtain key
word sets
each corresponding to a said merchandise category; and uniting the plural key
words sets
to form the word library.
[0058] Since tagging corpuses directly represents a prodigious workload, for
reducing difficulty
and improving efficiency of the tagging task, it is desired to categorize
corpuses in the
corpus data set according to a merchandise category table (e.g., a quaternary
merchandise
group). For example, the categories may include a clothes corpus group, a
pants corpus
group, a mobile phone corpus group, etc. Then the categorized corpuses are
segmented
so that every category group is formed by plural key words. Those irrelevant
key words
are filtered out (denoising key words), and the key words in every category
group are de-
duplicated, so as to ensure every key word is unique in its group. Eventually,
key word
sets are formed and each correspond to a category group. By uniting all the
key word sets,
the word library is formed.
[0059] In the embodiment, the step of tagging each key word in the word
library as either a
modifier word or a category word according to a part of speech of the word
comprises:
[0060] extracting the key words that are the modifier words or the category
words from the word
library by means of manual tagging and tagging the corresponding parts of
speech; and/or
extracting the key words that are the modifier words or the category words
from the word
library using a machine tagging model and tagging the corresponding parts of
speech
using a machine tagging model.
[0061] As implied in the name, manual tagging refers to manually determining
whether a key
word in the word library is a modifier word or a category word, and manually
tagging the
key word. Differently, a machine tagging model implements automatically
recognizing
and tagging techniques. When the number of key words in the word library is
huge, such
a machine model is effective in improving tagging efficiency. However, as
demonstrated
8
Date Recue/Date Received 2023-10-25
in practice, while a machine model provides high efficiency, its tagging
results are less
precise than those from manual operation. Therefore, it is preferred to
combine the two
solutions for tagging key words in the word library. For example, a machine
model is first
used to pre-tag numerous key words, and then manual verification is performed,
so as to
balance and maximize efficiency and precision of key-word tagging.
[0062] after the step of extracting the key words that are the modifier words
or the category
words from the word library by means of manual tagging and tagging the
corresponding
parts of speech, the method further comprises:
[0063] crawling new merchandise title data, performing word segmentation
thereon, and
matching resulting words with the key words in the word library; if a number
of the key
words that have matches is smaller than a threshold, adding the key words in
the new
merchandise title data into the corresponding key word sets, and tagging the
newly added
key words for their parts of speech; or if the number of the key words that
have matches
is greater than the threshold, crawling new merchandise title data, performing
word
segmentation thereon, and matching resulting words with the key words in the
word
library again.
[0064] The objective of the embodiment is to increase word sources for the
word library. By
keeping acquiring new merchandise title data, the robustness of the key words
in the word
library can be evaluated. Specifically, word segmentation is performed on the
merchandise title data, and the results are filtered so that only those key
words whose
parts of speech are identified as modifier words and category words are kept.
When the
number of the left key words and the number of the key words in the word
library are
smaller than a threshold, it indicates that the key words in the word library
are not robust
enough. At this time, the key words in the merchandise title data that do not
have matches
are supplemented into the corresponding key word sets. The newly added key
words are
tagged by their parts of speech. On the contrary, if the number of the left
key words and
the number of the key words in the word library are greater than the
threshold, it indicates
that the collection of the key words in the word library is competent to deal
with the
current merchandise title data. Thus, a user can continue to crawl new
merchandise title
9
Date Recue/Date Received 2023-10-25
data and repeat the foregoing process to continuously assess the word library.
Exemplarily, the threshold is 3.
[0065] after the step of extracting the key words that are the modifier words
or the category
words from the word library using a machine tagging model and tagging the
corresponding parts of speech using a machine tagging model, the method
further
comprises:
[0066] based on a semantic recognition technology in the machine model,
extracting the key
words that are the modifier words or the category words from the newly crawled
merchandise title data, adding them into the corresponding key word sets, and
tagging
the newly added key words for their corresponding parts of speech.
[0067] Optionally, the machine model may be a BiLSTM+CRF deep learning model.
By using
such a deep learning model to extract the key words that are modifier words or
category
words from the newly crawled merchandise title data, tagging the key words and
adding
them into the corresponding key word sets, the deep learning model
demonstrates great
adaptivity and can automatically recognizing category words and modifiers in
the
merchandise title according to contextual information.
[0068] Further, in the embodiment, the step of performing word segmentation on
the original
merchandise title data so as to obtain plural title words, matching each of
the title words
with the key words in the word library, respectively, and outputting the key
words that
have matches comprises:
[0069] recognizing the merchandise categories in the original merchandise
title data, and
matching them with the corresponding key word sets; and segmenting the
original
merchandise title data into the plural title words, matching each of the title
words with
the key words in the corresponding key word set, and sieving out the key words
that have
matches.
[0070] Preferably, multiple different original merchandise title data may be
acquired at the same
time and matched with the word library, respectively. Then parallel processing
is
performed to output plural merchandise short-titles.
[0071] In practical implementations, merchandise categories in different
original merchandise
Date Recue/Date Received 2023-10-25
title data can be recognized at the same time and have respective matched key
word sets.
The original merchandise title data are segmented into plural title words.
Then each of
the title words is matched with the key words in the corresponding key word
set, and the
key words have matches in the original merchandise title data are sieved out.
[0072] Further, in the embodiment, the step of sieving out at least two
effective key words from
the plural key words, and stitching the effective key words into the
merchandise short-
title according to their parts of speech comprises:
[0073] recording location information of each of the key words in the original
merchandise title
data; if in the key words tagged as the modifier words, there are plural said
key words
whose lexical scopes have intersection, only one said key word in the
intersection is kept;
if in the key words tagged as the modifier words, there are plural said key
words in which
the lexical scope of one said key word contains the lexical scope of another
said key word,
only the key word has the largest lexical scope is kept; if the key words
tagged as the
category words have word sense containing word sense of any said key word
tagged as
the modifier word, the key word corresponding to the modifier word is removed;
and
defining the left key words as the effective key words, and stitching them
into the
merchandise short-title according to locational sequence thereof. In practical
implementations, the key words tagged as the category words in the original
merchandise
title data are processed first.
[0074] It is understandable that, according to the word count of the
merchandise short-title,
modifier key words and category key words satisfying preset criteria can be
found and
then they can be stitched together according to their locational sequence, so
as to form a
fluent merchandise short-title. The described embodiment is for explaining how
to
generate a merchandise short-title from original merchandise title data. If
there are
different original merchandise title data, the foregoing process may be
repeated as many
times as required, thereby facilitating batch generation of merchandise short-
titles.
[0075] Embodiment 2
[0076] The present embodiment provides an apparatus for generating merchandise
short-titles,
11
Date Recue/Date Received 2023-10-25
comprising:
[0077] a data collecting unit, for crawling merchandise title data and/or
collecting search term
data, so as to construct a corpus data set;
[0078] a word library unit, for based on a merchandise category table,
categorizing corpuses in
the corpus data set by merchandise categories, and then extracting key words
to construct
a word library;
[0079] a word tagging unit, for tagging each key word in the word library as
either a modifier
word or a category word according to a part of speech of the word;
[0080] a word matching unit, for performing word segmentation on the original
merchandise title
data so as to obtain plural title words, matching each of the title words with
the key words
in the word library, respectively, and outputting the key words that have
matches; and
[0081] a processing unit, for sieving out at least two effective key words
from the plural key
words, and stitching the effective key words into the merchandise short-title
according to
their parts of speech.
[0082] As compared to the prior art, the disclosed apparatus for generating
merchandise short-
titles provides beneficial effects that are similar to those provided by the
disclosed smart
method for generating merchandise short-titles as enumerated above, and thus
no
repetitions are made herein.
[0083] Embodiment 3
[0084] The present embodiment provides a computer-readable storage medium, in
which a
computer program is stored. When run by a processor, the computer program
executes
the steps of the method for generating merchandise short-titles as described
previously.
[0085] As compared to the prior art, the disclosed computer-readable storage
medium provides
beneficial effects that are similar to those provided by the disclosed smart
method for
generating merchandise short-titles as enumerated above, and thus no
repetitions are
made herein.
[0086] As will be appreciated by people of ordinary skill in the art,
implementation of all or a
part of the steps of the method of the present invention as described
previously may be
12
Date Recue/Date Received 2023-10-25
realized by having a program instruct related hardware components. The program
may
be stored in a computer-readable storage medium, and the program is about
performing
the individual steps of the methods described in the foregoing embodiments.
The storage
medium may be a ROM/RAM, a hard drive, an optical disk, a memory card or the
like.
[0087] The present invention has been described with reference to the
preferred embodiments
and it is understood that the embodiments are not intended to limit the scope
of the present
invention. Moreover, as the contents disclosed herein should be readily
understood and
can be implemented by a person skilled in the art, all equivalent changes or
modifications
which do not depart from the concept of the present invention should be
encompassed by
the appended claims. Hence, the scope of the present invention shall only be
defined by
the appended claims.
13
Date Recue/Date Received 2023-10-25