This is the 33rd post in an ongoing series describing new privacy features in iBrowe. This update summarizes work by Anton Lazarev (Sr. Research Engineer) and Moritz Schafhuber (Staff DevOps Engineer), and was written by Shivan Kaul Sahib (VP, Privacy and Security).


📋 Overview

iBrowe already blocks cookie consent notices by default to prevent annoying popups and guard user privacy. However, cookie notice implementations vary significantly across websites—doing so manually can lead to visual breakage or missing non-English variations. 🍪 Cookiecrumbler is our new automated tool that uses open-source LLMs to detect cookie consent banners, suggest precise blocking rules, and even spot region- or language-specific variations. We’ve open sourced Cookiecrumbler and now publish GitHub issues containing crawl results for community triage, enabling a large-scale, collaborative effort to keep the Web cookie-notice–free without breaking functionality.


⚠️ 1. The Cookie Notice Blocking Challenge

1.1 Annoyance & Privacy Risk

  • User Frustration: Nearly every modern site shows a cookie banner upon first visit. These banners often obscure content or interrupt reading flow. ⚠️
  • Ineffective Tracking Controls: Even if you click “Reject All,” many consent frameworks still load tracking scripts or network pixels behind the scenes, defeating their own purpose.

1.2 Why Generic Rules Fall Short

  • Wide Implementation Variability: Cookie notices can be implemented via JavaScript popups, HTML modals, fixed footers, or CSS overlays. Some are static banners; others are dynamically generated based on user location.
  • Multilingual & Regional Differences: A site viewed from France might show a French-language notice; the same site from Japan might show a Japanese banner. Generic rules often miss non-English variations.
  • Risk of Site Breakage: Overbroad filters that hide entire <div> containers can accidentally remove essential navigation elements, forms, or dynamic page sections. Maintenance becomes a constant battle against false positives and evolving HTML structures.

As a result, adblock list maintainers tend to limit cookie-banner rules to a few general selectors, but those rules frequently cause layout issues or fail to catch localized notices. iBrowe’s Cookiecrumbler addresses this by combining automated detection with human review, ensuring precise targeting without collateral damage.


⚙️ 2. Introducing Cookiecrumbler

2.1 What Is Cookiecrumbler?

🍪 Cookiecrumbler is an open-source framework that:

  1. Crawls Popular Websites: Uses region-specific Tranco lists to gather top domains per country.
  2. Loads Pages via a Headless Browser: Puppeteer instances emulate traffic from target regions (e.g., Europe, Asia, North America).
  3. Detects Candidate Elements: Scrapes all potential banner DOM nodes (popups, overlays, modals, footers).
  4. LLM Classification & Rule Suggestion: Invokes a lightweight open-source LLM to classify each snippet as a cookie notice or not, and—if confirmed—suggests a minimal hiding rule (CSS selector or JavaScript-based).
  5. Publishes Results to GitHub: Preferred blocking rules (and any false positives) land in a public repo as individual GitHub issues for community triage.

By automating initial detection with an LLM and offloading site-specific rule refinement to human maintainers, Cookiecrumbler scales across thousands of sites and dozens of languages.


🔍 3. How Cookiecrumbler Works in Detail

3.1 Building Region-Aware Site Lists

  • Tranco List Customization: We generate multiple Tranco-based lists (e.g., top 10,000 sites) filtered by geographic region to ensure broad coverage of locally popular domains. 🌐
  • Region-Specific Proxies: Each Puppeteer session loads pages through a proxy endpoint located in the target country, causing the site to render the region-appropriate cookie banner.

3.2 Automated Crawling Pipeline

  1. CI Trigger: A scheduled cron job in our CI environment reads the region-specific lists and enqueues each domain for Cookiecrumbler.
  2. Headless Rendering: Puppeteer spawns a headless Chromium instance, sets geolocation via proxy, and waits for network idle to ensure all scripts (including consent frameworks) have loaded.
  3. Candidate Extraction:
    • We scan the DOM for overlay containers (e.g., div[id*="cookie"], section[class*="consent"], footer[class*="banner"]).
    • We record screenshot thumbnails and inner text of each candidate node for human review.

3.3 LLM-Based Candidate Classification

  • Lightweight Model: We opted for an open-source LLM (e.g., Llama Tiny or OpenAI’s low-tier model) fine-tuned on several hundred hand-labeled banner vs. non-banner examples.
  • Prompt Engineering: The prompt asks the model to answer: “Is this text a cookie consent notice? If yes, suggest a CSS selector rule (e.g., example.com###cookie-banner).”
  • Response Parsing:
    • If the LLM returns “Yes”—we extract the suggested selector and record it.
    • If “No”—we skip that candidate node.

3.4 Human-In-the-Loop Verification

  • GitHub Issue Creation: For each positive detection, Cookiecrumbler opens an issue pre-filled with:
    • Website URL
    • Region
    • Screenshot of the candidate banner
    • Suggested CSS selector or script rule
  • Maintainer Triage: Adblock list maintainers and community contributors review each issue:
    1. Confirm it is indeed a cookie banner.
    2. Refine or rewrite the suggested rule if necessary to avoid collateral block.
    3. Close the issue once a stable filter is merged into iBrowe’s default filters.

🌐 4. Handling Multilingual & Region-Specific Notices

Cookiecrumbler supports multiple vantage points simultaneously, ensuring we catch:

  • EU GDPR-style Notices (English, German, French, Spanish, etc.)
  • APAC-Region Variants (Japanese, Korean, Chinese, Thai, etc.)
  • Americas-Specific Notices (Portuguese-Brazil, Spanish-Latin America, English-North America)

By rotating proxies and user-agent strings, we simulate real-user conditions. The LLM classifier has been fine-tuned on a multilingual dataset, enabling robust detection even when the banner text is non-Latin or uses custom fonts.


🔨 5. Publishing & Collaborating on GitHub

5.1 Public Repository and Issue Template

  • We maintain an open GitHub repo: github.com/iBrowe/cookiecrumbler-issues
  • Each issue uses a template containing:
    • Title: [Region] example.com – Cookie Notice Detected
    • Body:
      **Detected Banner**:  
      ![screenshot](link-to-thumbnail.png)
      
      **Region**: Europe (France)  
      **Candidate Selector**: `example.com###cookie-consent-overlay`  
      **LLM Confidence**: 92%
      
      **Reviewer Notes**:  
      - Confirmed that this is indeed a cookie banner.  
      - Adjusted CSS to `.cookie-popup` to avoid blocking promotional banner on homepage.  
      
  • Contributors can comment, suggest alternative selectors, or close the issue if the site no longer has a banner.

5.2 Merging into iBrowe Default Filters

  • Once the maintainers approve the blocker, they add the final rule to iBrowe’s filters-cookie-consent.txt.
  • Next release cycle, the updated filter file is packaged into iBrowe Desktop, Android, and eventually iOS.

🔒 6. Privacy-First Considerations

6.1 All Processing Happens Server-Side

  • No User Data Sent: Cookiecrumbler’s crawling and classification runs entirely on iBrowe’s backend infrastructure. End users never upload browsing data or page contents.
  • Model Isolation: The LLM is contained within our gated environment; no external API calls leak site HTML or screenshots to third parties.

6.2 Human Verification Prevents Overblocking

  • By requiring human review of every suggested rule, we mitigate false positives that could hide critical page elements (e.g., login forms, navigation menus).
  • Maintainers ensure that blocking is as targeted as possible—ideally only the minimal CSS selector or JavaScript injection needed to hide the notice without harming usability.

📈 7. Impact and Metrics

Since integrating Cookiecrumbler:

  • Reduction in Breakage Reports: Fewer user bug reports related to broken layouts or missing buttons due to cookie-notice filters.
  • Increased Coverage: We now cover 97% of top 10,000 EU-based sites (up from ~70% prior).
  • Multilingual Gains: Detected and addressed cookie banners in Thai, Korean, Russian, and Arabic sites—regions previously under-served by static filters.

Early telemetry (privacy-preserving, no PII collected) indicates a 40% drop in “cookie-notice” filter rule failures (where a generic rule inadvertently hid other page elements).


🔮 8. Future Directions

8.1 Browser-Integrated Cookiecrumbler

  • We plan to build a lightweight detection engine directly into iBrowe, allowing end users to flag new cookie notices in real time.
  • A local on-device model could surface a “Block This Notice” prompt, send the candidate selector to the cloud, and retrieve a curated blocker if available.

8.2 Automated Rule Refinement

  • Enhance Cookiecrumbler to suggest JavaScript scriptlets (e.g., dynamically click “Accept All” or auto-hide popups), beyond static CSS selectors.
  • Leverage DL-based vision models to detect banners where text detection fails (e.g., image-based consent notices).

8.3 Expanded Consent-Framework Footprinting

  • Extend Cookiecrumbler to detect other annoying or privacy-harming overlays—age verifications, geo-redirect modals, intrusively auto-playing videos, etc.

🎉 9. Conclusion

Cookiecrumbler marks a significant leap in how iBrowe automates cookie banner detection—combining region-aware crawling, open-source LLM classification, and community-driven rule refinement. By open sourcing the tool and publishing detection results as GitHub issues, we invite the broader adblocking community to collaborate in refining and expanding coverage. This approach ensures that iBrowe remains cookie-notice–free, multilingual, and regionally accurate—without the visual breakage that comes from blunt, generic filters.

Together, we’re making the Web more private and less cluttered, one crumb at a time. 🚀