Accept Duplicate Entry Exception for Performance Benefits?

Question

I am currently programming a news API. To fetch news I am using java to parse XML from a List of RSS Feeds (URLs) and write them to a mysql database. I am doing this at a regular interval i.e. every 5 minutes. As these news feeds often are identical or similar to the prior time fetching I currently get

Accepted Answer

Can you really tell if two documents are duplicates?  For example, I have seen two identical articles with different headlines.So, assuming you can say what part(s) need to be checked to dup, make a UNIQUE index in the table containing the news article.But, there is a problem &#8212; UNIQUE is limited in size.  In particular, the text of any article is likely to exceed that limit.So&#8230;  Take a &#8220;hash&#8221; or &#8220;digest&#8221; of the string.  Put that in the unique column.  Then, when you try to insert the same article again, you will get an error.Well, the &#8220;error&#8221; can be avoided by saying INSERT IGNORE ....A simple, and adequate, hash for this task is the function MD5() &#8212; available in SQL and most application languages.  It generates a constant length string that is almost guaranteed to be as unique as the text it is &#8216;digesting&#8217;.

Advertisement

Answer