View Issue Details

IDProjectCategoryView StatusLast Update
0018864phplist applicationStatisticspublic28-09-17 07:51
Reportermichiel 
PrioritynormalSeverityminorReproducibilityrandom
Status newResolutionopen 
Product Version 
Target VersionFixed in Version 
Summary0018864: filter duplicated or automated "clicks" from the statistics
Description
It seems at times that some systems caused automated "clicks" in the statistics, which then inflates the click statistics. It would be good to investigate and add an option to suppress these clicks when viewing the click statistics.
TagsNo tags attached.

Activities

michiel

24-09-17 20:34

manager   ~0059411

I just posted a link in a teams page, and it was immediately pre-fetched
and displayed. This is now a common thing to do in discussion systems. I
have a hunch someone is auto-posting the phpList mails on a system that
then goes off and fetches the content to display a summary in the chat room.

We need to prevent this from happening. @duncan can you check if these systems
obey the robots.txt file, or simply fetch it?

I do think I've added headers in the pages to block robots, but that
won't help if they're from elsewhere.

solmar

26-09-17 09:43

reporter   ~0059416

Duncan in Phplist dev mailing list on 18/09/2017 at 11:26:

Most subscribers are technical, engineers etc, so the emails are going
to mostly corporate subscribers, which I guess are more likely to have
automated checking of incoming emails.

I had hoped that it would be possible to discard clicks based on the IP
address or user agent, but the logging that I added doesn't show any
particular patterns.

This is a problem for the client because he sells banner ads in the
emails, the pricing of which is based on the expected number of clicks.
With this problem the phplist click statistics are going to be way too
high.

thanks

Duncan

solmar

26-09-17 09:45

reporter   ~0059417

Alessandro on 19/09/17 at 10:35 in Phplist dev mailing list:

Hello Duncan,

I too noticed inconsistencies between Phplist statistics and supposed
reality. You have provided an explanation for one! Users who seem to
be clicking always, but upon direct contact are not interested.

You mentioned that all strange clicks are done in few seconds. The
data that you forwarded shows the same user-id and message-id.

I wonder if clicks could be filtered/discarded based on the assumption
that:
- A software will check all links in a matter of seconds.
- A human being will take "tens of seconds", if actually reading the
pages (if not, these clicks are worthless anyway).

Phplist could have a switch to activate click filtering, and a
parameter to set the "Minimum timespan between clicks".

When a unique combination of user-id & message-id is observed doing
multiple clicks in less than the timespan, this is assumed to be a
software and clicks are discarded.

The filtering could run on cron, so statistics would be inflated only
until the next cron run. If cron is run often, statistics would be
virtually correct when observed.

Best,

Alessandro

duncanc

26-09-17 18:05

developer   ~0059422

The robots.txt file is being fetched mostly by real bots that identify themselves as such, e.g. Googlebot, bingbot. There are only a few fetches with the user agent looking like a browser, and those IP addresses do not look suspicious.

To explain what I did a week or so ago.
Added code to lt.php
1) when the page is accessed without an http referrer then return a page that redirects to itself using javascript.
2) When the page is accessed with an http referrer that matches the page, i.e. it has been redirected, then treat that as a genuine access.

/*
 * Add another level of redirecting to try to record only real clicks
 */

if (!(isset($_SERVER['HTTP_REFERER']) && false !== strpos($_SERVER['HTTP_REFERER'], $_SERVER['REQUEST_URI']))) {
    $requestUri = $_SERVER['REQUEST_URI'];
    echo <<<END
    

If you are not redirected automatically, follow this <a href='$requestUri'>link</a>.


    <script type="text/javascript">
    window.location.href = "$requestUri";
    </script>
END;
    exit;
}

This had mixed results. A few of the worst repeated clickers appeared to not follow the redirection and had no clicks at all. But many had just a few less than before, such going from 6 to 4 clicks of a link.
An adverse effect was that some requesters seemed to just repeatedly request the page, several times a second, before giving up. So the number of page accesses was a lot higher.

Because the results were not too helpful I have now removed that code.

samtuke

27-09-17 06:38

administrator   ~0059434

@duncanc that's great research. Would using JS checks more rigorous than just a redirect not block more of the bot traffic? A webserver mod can be used to limit the connections e.g. per minute, to prevent repeated requests.

duncanc

28-09-17 05:24

developer   ~0059442

The point is that the automated clickers identify themselves as browsers in the user agent, not as bots. It might be worthwhile filtering genuine bots (google, bing, etc) out of the click statistics, but that is not the problem.

samtuke

28-09-17 07:51

administrator   ~0059443

Yes, it sounds like checking the user agent and requiring JS redirect is insufficient to stop bit clicks, so I propose that more sophisticated JS-based tests are used to validate the clicks. I'm not sure what is available in existing open source libraries.