View Issue Details

IDProjectCategoryView StatusLast Update
0013187phpList 3 plugins[All Projects] Generalpublic06-02-19 11:35
Reporterflohack 
PrioritynormalSeveritymajorReproducibilityalways
Status newResolutionopen 
Summary0013187: RSS Import: getrss truncates links, then when checking for new items oversized items get imported again
DescriptionUsing phplist and serendipity for feeding...

When getrss retrieves the feed items, the SQL statement checks if title, link and listnumber are equal. It then imports the item in the database. The title field is only 100 characters long, which leads to truncation in the actual case where the title is very long or the rss feed component creates long links.
the next scheduled import picks up the truncated title in the database and compares it to the original - the compare does not match and the item gets imported again, and again, and again...

FIX:
change table rssitem field link to varchar(254) or text
Additional InformationIt was a very hard task to find this one! we examined the database and the code so many times. The titles are not truncated in the mails and rss db views, beacuase they are generated from rssitem_data, where every attribute is a text field. finally a link, count(link) select-statement on the rssitem table got me closer, but it was first revealed when I pressed on the "show me full columns" button in phpmyadmin. The truncation vs. the count of duplicates then proofed the case.
TagsNo tags attached.

Activities

flohack

03-02-08 18:07

reporter   ~0040340

as I found out today, also the field title is affected (yes we have news items with titles longer than 100 chars :)). Therefore it also should be changed to text or expanded to 255 chars...

funkypenguin

11-03-08 05:16

reporter   ~0042398

For an immediate solution, I patched getrss.php on line #102 from:

Sql_Query(sprintf('select * from %s where title = "%s" and link = "%s"', $tables["rssitem"], addslashes($item["title"]), addslashes($item["link"])));

To:

Sql_Query(sprintf('select * from %s where title = "%s" and link = "%s"', $tables["rssitem"], substr(addslashes($item["title"]),0,100), substr(addslashes($item["link"]),0,100)));

Essentially, this just compares the first 100 characters of the title and link field, eliminating the problem. While increasing the database field size to 255 chars will fix the problem most of the time, it's possible that somebody will STILL have URLs / titles longer than 255 characters. Whatever the decision re length of DB field is, I propose that the above "substr" chop be done to prevent accidental RSS item duplication ;)

See my full post here : http://www.funkypenguin.info/howto/how-use-phplists-rss-features-when-feed-urls-exceed-100-characters

(and that SEO-focused URL is the reason WHY we need to address this ;)

- David

11-03-08 05:17

 

phplist_rss_long_urls_patch_0.diff (848 bytes)
--- phplist_orig/admin/getrss.php	2007-08-28 01:01:27.000000000 +1200
+++ phplist/admin/getrss.php	2008-03-11 12:23:16.000000000 +1300
@@ -99,7 +99,8 @@
 			else
 				ProcessError($GLOBALS['I18N']->get('Process Killed by other process'));
 			$itemcount++;
-			Sql_Query(sprintf('select * from %s where title = "%s" and link = "%s"', $tables["rssitem"], addslashes($item["title"]), addslashes($item["link"])));
+			// Funky Penguin Mod - chop the title and the URL at 100 characters, since that's all the database will hold. SEO = long urls!
+			Sql_Query(sprintf('select * from %s where title = "%s" and link = "%s"', $tables["rssitem"], substr(addslashes($item["title"]),0,100), substr(addslashes($item["link"]),0,100)));
 			if (!Sql_Affected_Rows()) {
 				$newitemcount++;
 				Sql_Query(sprintf('insert into %s (title,link,source,list,added)

flohack

17-03-08 16:45

reporter   ~0042978

David,

thanks so far, just a remark, I think that the chance that URLs that are longer than 100 chars are unique on the very first 100 is rather high, so I would prefer an alternative to your patch, although this one will fix most of the user who experience issues with that.

FLoh

funkypenguin

17-03-08 19:24

reporter   ~0042987

Hi FLoh,

I'm not so sure, I can see it happening with certain URL conventions, where the "uniqueness" is at the end of the URL, for example:

http://my-favourite-tv-shows.com/episode-guide/comedy-and-drama/friends/season-three/the-one-with-the-extraordinarily-long-urls/page1.htm

http://my-favourite-tv-shows.com/episode-guide/comedy-and-drama/friends/season-three/the-one-with-the-extraordinarily-long-urls/page2.htm

etc...

flohack

09-04-08 10:29

reporter   ~0044443

Hi David,
what did you mean with not so sure, your example is very unique on the first 100 and therefore will suffer from the truncation and only the first entry will be imported.... that´s what I meant, 100 chars may not be enough fo form a unique key, better extend the DB fields *and* truncate to 255 or so...

FLoh

kerberos

10-07-08 19:52

reporter   ~0050096

I may be wrong, but wouldn't you use the values 0,99 in the substr function to get the 1st 100 characters?

gazouteast

02-10-10 17:18

reporter   ~0051107

'

Surely the query should be using two values minimum - such as the URL and the publish date/time - that would increase the uniqueness exponentially.

.

chrisa

23-05-11 14:57

reporter   ~0051359

Same problem here. Also seem to have problem with non-standard characters in the title - maybe unicode characters are cut off improperly? (cannot reproduce it at the moment)
Wouldn't it be better to compare via a checksum (md5) from the full url that is saved in a separate field? So you needn't bother about the length.

apug

05-09-11 08:29

reporter   ~0051399

@chrisa: I've the same problem with non standard characters, I've solved it changing line #108
from
$tables["rssitem"],addslashes($item["title"]),addslashes($item['link']),addslashes($feed[0]),$feed[1]));

to
$tables["rssitem"],addslashes(substr($item["title"],0,100)),addslashes(substr($item["link"],0,100)), addslashes($feed[0]),$feed[1]));

Issue History

Date Modified Username Field Change
31-01-08 17:24 flohack New Issue
03-02-08 18:07 flohack Note Added: 0040340
11-03-08 05:16 funkypenguin Note Added: 0042398
11-03-08 05:17 funkypenguin File Added: phplist_rss_long_urls_patch_0.diff
17-03-08 16:37 user1822 Target Version => 2.10.6
17-03-08 16:45 flohack Note Added: 0042978
17-03-08 19:24 funkypenguin Note Added: 0042987
09-04-08 10:29 flohack Note Added: 0044443
10-07-08 19:52 kerberos Note Added: 0050096
20-04-10 03:26 h2b2 Target Version 2.10.7 => 2.11.4
02-10-10 17:18 gazouteast Note Added: 0051107
23-05-11 14:57 chrisa Note Added: 0051359
05-09-11 08:29 apug Note Added: 0051399
23-05-12 01:00 michiel Project phpList 3 application => rssmanager
23-05-12 01:00 michiel Category RSS => General
06-02-19 11:35 erion Project rssmanager => phpList 3 plugins