On August 21st and 22nd I had the opportunity to present some recent work [1] at the Conference on Email and AntiSpam [2], held at Microsoft Research in Mountain View. If you have more than a passing interest in email, security and the like, my notes and observations from the conference follow below. rob Summary ------- - Keynote from Lois Greisman of the FTC: I was previously unaware of how involved the FTC was in preventing spam and phishing as part of their consumer protection directive. Between the 300,000 spam mails they collect each day and the 40,000 consumer complaints each week, the FTC has a very good view into the state of spam. They use this data to investigate new scams as well as determine the effectiveness and compliance of the CAN-SPAM act. - Keynote from Bradley Taylor of Google: Small team of engineers reactively act upon new spam challenges, largely manual process. Google faces problems of both inbound and outbound spam (using gmail to send spam). DKIM and SPF very useful, reject all unsigned ebay or paypal emails. Image spam war a solved problem. Providers exchange abuse information using abuse reporting format (ARF) records. Key problem now is in securing new account signup. - Conference included a "live-spam" challenge where researchers were given a 72 hour live feed of spam to process. Opportunity for researchers and industry to compete. - Some research trends: Robustly handling noisy (i.e. incorrect labeling) user feedback. New forms of spam including VoIP, blog, social networking spam, speech models for understanding communication context. Increased focus on phishing. August 21, 2008 --------------- *Keynote from Lois Greisman of the FTC: Consumer protection. How is unfairness applied in enforcement context? Could consumers reasonably avoid the scam? How hard to get rid of unauthorized charge? Even if noticed, the non-economic injury is substantial. Identify theft, safety measures companies must employ. FTC has used unfairness authority -- assume an entity has weak security or hacked. Is there substantial injury? Isolated or systemic problems? Were there readily available and inexpensive measures that could have prevented the breach? Just as we never eliminate fraud, we will not eliminate security breaches. But need to have steps in place. Example of how FTC uses its authority. Many different issues, spam, fraud, etc. How does FTC decide what to work on? Many complaints received from consumers. About 40,000 contacts from consumers in any given week. Rich source of data to understand what is going on in the marketplace. Also have a spam database. Receive about 300,000 spam messages every day. Mine and analyze the data. FTC has been active in spam since the 1990s. First case brought challenging spam was 1997. Hold spam summits, publish reports, effectiveness of regulation, etc. Active on law enforcement front -- more than 100 cases filed. Not a surprise, just another medium for advertising. Very cheap to move on-line, benefit of anonymity. CAN-SPAM act, effective opt-out, can't falsify sender. Prevent sexually explicit spam. Additional amendments in the context of definition of sender. Think of multiple senders, for instance an airline plus hotel plus car rental. Goal with these rules is to provide means for business to abide by. Clarification, streamlining, facilitate ecommerce. Also focus on consumer eduction. CAN-SPAM has leveled playing field for legitimate entities. We're not seeing the type of sexually explicit emails as before. They're abiding by the opt-outs. Industry is abiding by the law. Filters are doing a good job. But spam is increasingly used for financial crime. The spam problem today is less about nuisance and volume, but dissemination of malware and phishing scams. Email with malicious code, worms, trojans, botnets. Phishing code than scans browser history and then generates an email from financial institutions you use. 50% of fortune companies authenticating email records. In combination with reputation tools, highly effective. FTC has described spam as one of the most intractable problems facing computer users. Authentication is critical, reputation, buy-in from ISPs, developing additional strategies. 50% penetration is great, but we should do better. On the law enforcement side, are there entities the FTC should be pursuing? We shouldn't limit our activities to traditional spam. For instance mobile marketing. Mobile browsers and data charges inhibit mobile use. Text messaging, ring tones, ring backs are phenomenal money makers. What makes text messaging so successful? Permits personalization. Different types of efficiencies in that type of communication. Blocking 100-200 million text spams per month. Suspect text spam is going to create an enormous set of challenges both technical and education. Different gateways, different legal parameters that may or may not apply. FTC will continue to bring cases against spam. Continue to promote education, prevent consumers from being victimized, companies know how to comply with regulations. Let us know where we should shift R&D. Want a close dialog with researchers. *Live Spam Challenge. Research and commercial applications competing. 72 hour live email feed. 143k messages over 3 days. Spam traps from project honeypot. Spam and ham traps from Gordon Cormack. Donated ham with headers rewritten to look live. Winner: solido systems. *Reputation for sender identity. With SPF, DKIM, 89+ of authenticated senders are spammers. Repuscore: collaborative reputation. Collaborate to get a better view of domains on internet. Reputations computed on intervals. Fraction of good mails. Reported reputation dampens old reputation. *Steve Webb, GaTech: Social networking fake profiles, advertising viagra, etc. Create Myspace honeypot, fake profiles, one for each state. Correlation with holidays. Bot that is operating the fake profiles, takes friends requests, downloads their data, doesn't become friends. Geographic variation of spam friend requests, Omaha most popular, large fraction directed at Midwestern states. Only hypotheses of why. Friend requests predominantly reportedly come from California, no validation as to true location. No correlation between origin and destination. 15% of spam profiles have same HTML as remaining 85%. How often do friend requests get sent? Do we see repeat friend request "offenders?" 65 profiles sent to more than one of our honeypots. 97% of duplicates sent within 4 minutes of each other. Following HTTP links and redirects, about 10% are unique. After doing shingling analysis, only 6 distinct sites. * Abusive comments in blogs, blog spam: measuring rate of flagging on most popular rediff.com blog. Motivation for abusive comments seem to be personal or social. For instance, large percentage of obscenities and insults. 12% commercial. Many readers per message. Human annotator agreement in creating a gold standard data corpus is low, only about 75%. Inherently hard problem, similar to gray mail. August 21, 2008 --------------- Bradley Taylor, Google's gmail: Use of gmail as a file server, 7GB, causes problems with folks sharing files, mp3s. Spammers create lots of accounts and sends mail through them, get around individual account sending limits. Account hijacking through phishing. Also mailbombing problems, targeting a single person. Spam is easier to detect because it's directed. What is spam? Spam is whatever our users say is spam, if they click to indicate a message is spam. Minimize the number of messages they click as spam. Spammers have more machines than google. Privacy -- can't manually view email without user's permission. Even if they could, doesn't scale. Algorithmically determine good users and automatically select their mail for training (some users don't mark spam as spam). Manual view helps for debugging. Do ask users if google can view the email. Spam reporting rate of less than 1%. Only a few major campaigns per day. Monitor and act quickly. Manual task, but automate where possible. Rejecting spam is the best message to send spammers. Does not discourage spammers to put emails in spam folder -- users that want viagra will look in their spam folder to get it. Just reject it. IP addresses are bad: lots of mail is forwarded to gmail (often for spam filtering). Webmail, hard to determine where the actual mail came from. New IPs, newly infected machines, sometimes legitimate. DHCP effects. Using authentication (SPF, DKIM) for anti-phishing. Reject unsigned or unauthenticated ebay or paypal mails. Yahoo does this also. Authentication alone isn't enough, also have to do reputation. Remove all open HTTP redirector. Backscatter spam, joe-jobbing accounts. Gmail's solution on inbound side is to consider generating bounce messages as evil. Webmail spam: growing problem. As port 25 becomes more difficult to use, port 80 becomes more attractive because it can't be blocked easily. Outbound spam different problem, false positives harder to handle. Controlling signups is critical. CAPTCHAs are becoming ineffective, cheap labor accounts for most of the problem. Increasing ability of spammers to break algorithmically. Before signups, spammy accounts were rare. Problem much more severe now with open signup. Audio CAPTCHA easier to crack, good area for research. How to determine if it's paid labor that's answering CAPTCHAs. Feedback loops. Abuse reporting format (ARF) sent back to domain when messages marked as spam. Exchanged with other providers. Also look at bounces. Hijacking as future of spam vector. ---- [1] http://rbeverly.net/research/papers/spamflow-ceas08.html [2] http://www.ceas.cc