Approach

  • Gather ad images from popular & misinformative websites
    • Initialize a new browser session for each URL.
    • Get a HAR (HTTP Archive format) of the current webpage being crawled.
    • Gather URLs in webpage (src or href attribute).
    • If the URL match a ad filter list (check Ublock for details), try to capture an image of the ad
    • Also record text content of the page if we ever need to do topical analysis
  • Determine a taxonomy. IAB or manual. Manually label each taxonomy token / label as normal or misinformative (i.e. breaks content policy, advocates for a political candidate, etc)
  • Determine how CLIP works and its quota.
  • Submit each image to CLIP to get a label.

Crawler Design

  • Filter list setup
    • We should use all filter lists to maximize (true) positives
    • Since the filter lists may also match trackers, we decide if the element / screenshot is an ad based on its resolution / area.
    • use gist to download filter lists
    • process filter lists (AdblockRules)
  • Selenium setup
    • use this chrome webdriver, because some might try to block the crawler
    • see gist for chrome flags for selenium
      • window size not required
      • comment --headless (otherwise there won’t be any GUI, which is hard to debug, and some sites may even detect headless mode and block the crawler)
  • Websites list
    • Popular / legit websites list: Tranco List
      • better & dynamically updated website ranking
      • We may need to exclude sites like google (search engines), facebook, twitter (user-generated content platforms), etc
      • We need to manually go through the list and find 100 sites with random ads
    • tmz.com:
      • lots of random ads—use this site to test ad categorizing
      • could also be categorized as bad site
    • CSV file contains around 200 bad sites. Find 100 bad sites with ads on homepage (see the comment column of CSV to figure out which one don’t have ads on homepage).
  • Crawling Process
    • Prepare driver:
      • We need to reinitialize a browser session for each visit to observe contextual ads, not behavioral ads
      • driver.{quit,close}() then driver.get()
    • Load content
      • Some pages may not load all ads at once. For such pages that dynamically load content, scroll down fully once. Also consider alternative scrolling methods to ensure that all ad content loads (e.g. scroll partially, wait, rinse and repeat until we reach bottom).
      • Wait 30-60s to load entire page to be safe, scroll to bottom to load all content and ads, wait additional time (e.g. 15s); consider other scrolling approaches (partially then wait, repeat; set time out for infinite scroll)
    • Ad identification
      • An ad could be “sub_frame” (iframe), “script” (e.g. a script that inserts an image into the page), “image.” We are mainly interested in iframe and image elements.
      • For scripts, we still match script URLs with filter lists. Since scripts are not visual elements, we can only store the script URLs for now. We might be able to track association between filters & image vs script vs iframe and come up with some more efficient way to categorize ad images.
      • For iframe and img, we simply match src / href with filter lists. See gist for how it’s done.
      • Using filter lists may be time consuming. How does browser plugins make it this fast?
    • Data extraction
      • Save HAR file: BrowserMob Proxy vs Chrome DevTools API
      • For each potential ad image, decide if screenshot is valid ad (e.g. image is over a certain size). Store as file if yes.
      • Remember to extract textual content in web page (potential topical analysis)
    • Data Organization
      • domain-name/<base64-path>_<timestamp>.har
      • domain-name/<base64-path>_<timestamp>_<img|iframe>_<ad-id>.jpg (ad image)
      • domain-name/<base64-path>_<timestamp>_text.txt (extracted text content)
      • domain-name/<base64-path>_<timestamp>_scripturls.txt (script URLs)
      • <base64-path> may be empty for homepages.
  • Project Considerations
    • if time permits: topical analysis based on ad and specific article content (not home page)
    • example ad data on misinformative sites: 4000 ads, 2000 distinct advertisers, ~200 misinformative sites; good sites may have more ads
    • We need to keep the code scalable for crawling more sites (in case we decide publish a paper).

About Research

In a real-life research, you typically have to:

  • Define the research problem
    • List research questions
      • What is the research trying to solve?
      • What might come out of it? Is it worth it?
    • We might not have the research questions in the beginning. Sometimes we come up with the questions that the research answers at the end of the research.
  • Do a iterature survey: research existing, related works after finalizing the research problem
    • Check if your work has been done.
    • Get inspired by new works.
    • Check out the “Related Works” section in each paper for good next papers to read.

Resources

Filter lists:

raw_lists = {
    'easylist': 'https://easylist.to/easylist/easylist.txt',
    'easyprivacy': 'https://easylist.to/easylist/easyprivacy.txt',
    'antiadblock': 'https://raw.github.com/reek/anti-adblock-killer/master/anti-adblock-killer-filters.txt',
    'blockzilla': 'https://raw.githubusercontent.com/annon79/Blockzilla/master/Blockzilla.txt',
    'fanboyannoyance': 'https://easylist.to/easylist/fanboy-annoyance.txt',
    'fanboysocial': 'https://easylist.to/easylist/fanboy-social.txt',
    'peterlowe': 'http://pgl.yoyo.org/adservers/serverlist.php?hostformat=adblockplus&mimetype=plaintext',
    'squid': 'http://www.squidblacklist.org/downloads/sbl-adblock.acl',
    'warning': 'https://easylist-downloads.adblockplus.org/antiadblockfilters.txt',
}