Bringing Fun Back To Referrer Logs
Referrer logs used to be one of the fun on the early web, but they have been ruined.
I decided to fix them.
The Old Ways
Quitting Twitter and social media has made my life more pleasant overall, but there are tradeoffs.
The feedback from an audience – which has been weaponized into an addictive feedback loop through an over-focus on “engagement” metrics – wasn’t always like this.
Feedback used to come from log files created by your web server and then some analysis of the quiet footprints left by readers. (Sure, there were comments and discussion boards and email and AIM, but logs were the common denominator.)
While IP addresses and number of accesses (“hits”) are interesting, you can also see referrers – the URLs that people clicked to arrive at your site. This helps you see if anyone was actually discussing what you wrote, which was the web pre-cursor to the at-mentions and similar things.
· · ·
I had turned off logging on my web servers a while ago because I decided that I didn’t care to know how tiny my audience was. Part of that was referrer logs had become useless – instead of being filled with URLs of links to me, it was filled with spammers faking the referrer field.
Why were people spamming referring logs? Some web servers and CMS’s are configured to generate lists of referrers and publish them on the web, and if these end up being publicly accessible and crawled by a search engine, then this is a way to generate “free” links back to a spammy site from a legitimate site, boosting their search engine placement potentially, or helping generate traffic in other ways.
I decided to turn logging back on as I changed servers a couple weeks ago and the referrer log situation was ever worse than I remembered.
Simple Text Filters vs. Complex Machine Learning Filtering Algorithms
Looking at this log full of garbage, I felt incredible sadness that something that used to bring me and early web pals joy had been destroyed so thoroughly by unintended effects of search engines and advertising.
I looked at some existing block lists and tried them out but they weren’t particularly effective as the spam URLs constantly change.
I considered using some sort of self-hosted Javascript and log analysis to try and distinguish legitimate visits with a real referrer vs. spammy bots with fake referrers. And looking at the logs this probably would have worked sometimes but it’s an asymmetric thing – if people filter out “HEAD” requests, spammers just start doing “GET” requests. If you filter out requests that don’t fetch the CSS, they just fetch the CSS.
It was clear from the logs that was already happening. It’s a losing game of whack-a-mole.
So maybe train a machine learning algorithm on good vs. bad visits and distinguish spam IPs? Sounds needlessly complicated and also susceptible to the same issues.
Thinking asymmetrically, of what the spammers can’t do, made the solution seem obvious in retrospect. Spammers can’t include actual links to the million sites they spam. So just fetch the HTML of the referrer on my server, and eliminate any that don’t include a link to me.
This sounded like a job for simple UNIX tools pipelined together on top of my logs.
A pointless job, probably, here’s how I did it. I use NGINX on OpenBSD but the basics here should probably work on a Linux/*BSD machine and other web servers with a little tweaking.
check_ref
I created a simple shell script to check if a referrer URL contains a link to me –
#!/usr/bin/env ksh
#
#
# USAGE: check_ref TARGET REFEFFER
#
# fetches HTML at REFERRER, checks that includes TARGET
#
# if it does, output the REF
# if not, no output
#
# writes checked URLs to BAD_REFS and GOOD_REFS to avoid repeated downloads
#
TARGET=$1
REF=$2
BAD_REFS=/tmp/badrefs
GOOD_REFS=/tmp/goodrefs
if [[ -e $BAD_REFS ]]; then
if [[ `grep -c $REF $BAD_REFS` != 0 ]]; then
return 0
fi
else
touch $BAD_REFS
fi
if [[ -e $GOOD_REFS ]]; then
if [[ `grep -c $REF $GOOD_REFS` != 0 ]]; then
echo $REF
return 0
fi
else
touch $GOOD_REFS
fi
if [[ `curl --max-time 5 --user-agent "refbotcheck" --silent $REF | grep -c $TARGET` != 0 ]]; then
echo $REF
echo $REF >> $GOOD_REFS
else
echo $REF >> $BAD_REFS
fi
Then calling it looks something like –
$ check_ref trenchant.org http://adammathes.com
http://adammathes.com
$
$ check_ref trenchant.org http://example.com
$
This way I can pipe a series of URLs to it and filter out the junk that doesn’t link.
good_refs
I then created a script that grabs the day’s referrers, pipes them to the check_ref script.
I’m using a combined access/referrer log format in nginx defined as –
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
The script I put together to filter down to the good referrers –
#!/usr/bin/env ksh
TARGET=trenchant.org
LOG=/var/www/logs/trenchant.log
cat $LOG \
| grep `date '+%e/%b/%G'` \
| awk '$6 !~ /head/ && $11 ~ /http/ {print $11}' \
| tr -d '\"' \
| sort \
| grep -v "://$TARGET" \
| grep -v "://www.$TARGET" \
| xargs -n1 ~/bin/check_ref $TARGET \
| uniq -c
This incantation line by line –
1. Cat the log file
2. Filter to entries that include today's date in the format 5/Feb/2018.
3. Remove HEAD requests and those without HTTP in the referrer field
4. Remove double quotes from the output
5. Sort the referrers
6. Remove internal referrers
7. Remove more internal referrers
8. Run each referrer through check_ref, calling check_ref $TARGET $REF
9. Count the results (which we can do with uniq since we pre-sorted)
I threw this in cron to run at the end of the day and now I get an email of any good referrers each morning.
Or I get an email with errors complaining why it didn’t work. Or nothing because nobody legitimately linked to me.
Anyway, it’s a small bit of joy reclaimed from the early web.
Next: bringing the ease of social media style one-click feedback to a phone to the old web.
· · ·
If you enjoyed this post, send emoji to my phone
🐸
🎯
🍩