On August 21st and 22nd I had the opportunity to present some recent
work [1] at the Conference on Email and AntiSpam [2], held at
Microsoft Research in Mountain View.  If you have more than a passing
interest in email, security and the like, my notes and observations
from the conference follow below.  

rob


Summary
-------
- Keynote from Lois Greisman of the FTC: I was previously unaware of
  how involved the FTC was in preventing spam and phishing as part of
  their consumer protection directive.  Between the 300,000 spam mails
  they collect each day and the 40,000 consumer complaints each week,
  the FTC has a very good view into the state of spam.  They use this
  data to investigate new scams as well as determine the effectiveness
  and compliance of the CAN-SPAM act.

- Keynote from Bradley Taylor of Google: Small team of engineers
  reactively act upon new spam challenges, largely manual process.
  Google faces problems of both inbound and outbound spam (using gmail
  to send spam).  DKIM and SPF very useful, reject all unsigned ebay
  or paypal emails.  Image spam war a solved problem.  Providers
  exchange abuse information using abuse reporting format (ARF)
  records.  Key problem now is in securing new account signup.

- Conference included a "live-spam" challenge where researchers were
  given a 72 hour live feed of spam to process.  Opportunity for
  researchers and industry to compete.

- Some research trends: Robustly handling noisy (i.e. incorrect
  labeling) user feedback.  New forms of spam including VoIP, blog,
  social networking spam, speech models for understanding
  communication context.  Increased focus on phishing.


August 21, 2008
---------------
*Keynote from Lois Greisman of the FTC: Consumer protection.  How is
unfairness applied in enforcement context?  Could consumers reasonably
avoid the scam?  How hard to get rid of unauthorized charge?  Even if
noticed, the non-economic injury is substantial.  

Identify theft, safety measures companies must employ.  FTC has used
unfairness authority -- assume an entity has weak security or hacked.
Is there substantial injury?  Isolated or systemic problems?  Were
there readily available and inexpensive measures that could have
prevented the breach?  Just as we never eliminate fraud, we will not
eliminate security breaches.  But need to have steps in place.
Example of how FTC uses its authority.

Many different issues, spam, fraud, etc.  How does FTC decide what to
work on?  Many complaints received from consumers.  About 40,000
contacts from consumers in any given week.  Rich source of data to
understand what is going on in the marketplace.  Also have a spam
database.  Receive about 300,000 spam messages every day.  Mine and
analyze the data.  

FTC has been active in spam since the 1990s.  First case brought
challenging spam was 1997.  Hold spam summits, publish reports,
effectiveness of regulation, etc.  Active on law enforcement front --
more than 100 cases filed.  Not a surprise, just another medium for
advertising.  Very cheap to move on-line, benefit of anonymity.
CAN-SPAM act, effective opt-out, can't falsify sender.  Prevent
sexually explicit spam.  Additional amendments in the context of
definition of sender.  Think of multiple senders, for instance an
airline plus hotel plus car rental.  Goal with these rules is to
provide means for business to abide by.  Clarification, streamlining,
facilitate ecommerce.  Also focus on consumer eduction.  

CAN-SPAM has leveled playing field for legitimate entities.  We're not
seeing the type of sexually explicit emails as before.  They're
abiding by the opt-outs.  Industry is abiding by the law.  Filters are
doing a good job.  But spam is increasingly used for financial crime.
The spam problem today is less about nuisance and volume, but
dissemination of malware and phishing scams.  Email with malicious
code, worms, trojans, botnets.  Phishing code than scans browser
history and then generates an email from financial institutions you
use.  

50% of fortune companies authenticating email records.  In combination
with reputation tools, highly effective.  FTC has described spam as
one of the most intractable problems facing computer users.
Authentication is critical, reputation, buy-in from ISPs, developing
additional strategies.  50% penetration is great, but we should do
better.  On the law enforcement side, are there entities the FTC
should be pursuing?  We shouldn't limit our activities to traditional
spam.  For instance mobile marketing.  Mobile browsers and data
charges inhibit mobile use.  Text messaging, ring tones, ring backs
are phenomenal money makers.  What makes text messaging so successful?
Permits personalization.  Different types of efficiencies in that type
of communication.  Blocking 100-200 million text spams per month.
Suspect text spam is going to create an enormous set of challenges
both technical and education.  Different gateways, different legal
parameters that may or may not apply.  

FTC will continue to bring cases against spam.  Continue to promote
education, prevent consumers from being victimized, companies know how
to comply with regulations.  Let us know where we should shift R&D.
Want a close dialog with researchers. 


*Live Spam Challenge.  Research and commercial applications competing.
72 hour live email feed.  143k messages over 3 days.  Spam traps from
project honeypot.  Spam and ham traps from Gordon Cormack.  Donated
ham with headers rewritten to look live.  Winner: solido systems. 


*Reputation for sender identity.  With SPF, DKIM, 89+ of authenticated
senders are spammers.  Repuscore: collaborative reputation.
Collaborate to get a better view of domains on internet.  Reputations
computed on intervals.  Fraction of good mails.  Reported reputation
dampens old reputation.  


*Steve Webb, GaTech: Social networking fake profiles, advertising
viagra, etc.  Create Myspace honeypot, fake profiles, one for each
state.  Correlation with holidays.  Bot that is operating the fake
profiles, takes friends requests, downloads their data, doesn't become
friends.  Geographic variation of spam friend requests, Omaha most
popular, large fraction directed at Midwestern states.  Only
hypotheses of why.  Friend requests predominantly reportedly come from
California, no validation as to true location.  No correlation between
origin and destination.  15% of spam profiles have same HTML as
remaining 85%.  How often do friend requests get sent?  Do we see
repeat friend request "offenders?"  65 profiles sent to more than one
of our honeypots.  97% of duplicates sent within 4 minutes of each
other.  Following HTTP links and redirects, about 10% are unique.
After doing shingling analysis, only 6 distinct sites.  


* Abusive comments in blogs, blog spam: measuring rate of flagging
on most popular rediff.com blog.  Motivation for abusive comments seem
to be personal or social.  For instance, large percentage of
obscenities and insults.  12% commercial.  Many readers per message.
Human annotator agreement in creating a gold standard data corpus is
low, only about 75%.  Inherently hard problem, similar to gray mail. 


August 21, 2008
---------------
Bradley Taylor, Google's gmail: Use of gmail as a file server, 7GB,
causes problems with folks sharing files, mp3s.  Spammers create lots
of accounts and sends mail through them, get around individual account
sending limits.  Account hijacking through phishing.  Also mailbombing
problems, targeting a single person.  Spam is easier to detect because
it's directed.  What is spam?  Spam is whatever our users say is spam,
if they click to indicate a message is spam.  Minimize the number of
messages they click as spam.  Spammers have more machines than google.  

Privacy -- can't manually view email without user's permission.  Even
if they could, doesn't scale.  Algorithmically determine good users
and automatically select their mail for training (some users don't
mark spam as spam).  Manual view helps for debugging.  Do ask users if
google can view the email.  Spam reporting rate of less than 1%.  Only
a few major campaigns per day.  Monitor and act quickly.  Manual task,
but automate where possible.  Rejecting spam is the best message to
send spammers. Does not discourage spammers to put emails in spam
folder -- users that want viagra will look in their spam folder to get
it.  Just reject it.

IP addresses are bad: lots of mail is forwarded to gmail (often for
spam filtering).  Webmail, hard to determine where the actual mail
came from.  New IPs, newly infected machines, sometimes legitimate.
DHCP effects.  

Using authentication (SPF, DKIM) for anti-phishing.  Reject unsigned
or unauthenticated ebay or paypal mails.  Yahoo does this also.
Authentication alone isn't enough, also have to do reputation.  Remove
all open HTTP redirector.  

Backscatter spam, joe-jobbing accounts.  Gmail's solution on inbound
side is to consider generating bounce messages as evil.  Webmail spam:
growing problem.  As port 25 becomes more difficult to use, port 80
becomes more attractive because it can't be blocked easily.  Outbound
spam different problem, false positives harder to handle.  Controlling
signups is critical.  CAPTCHAs are becoming ineffective, cheap labor
accounts for most of the problem.  Increasing ability of spammers to
break algorithmically.  Before signups, spammy accounts were rare.
Problem much more severe now with open signup.  Audio CAPTCHA easier
to crack, good area for research.  How to determine if it's paid labor
that's answering CAPTCHAs.  Feedback loops.  Abuse reporting format
(ARF) sent back to domain when messages marked as spam.  Exchanged
with other providers.  Also look at bounces.  Hijacking as future of
spam vector.  


----
[1] http://rbeverly.net/research/papers/spamflow-ceas08.html 
[2] http://www.ceas.cc