user data crawling opens two windows, unable to control correct user browser #545

BZBY · 2024-11-07T02:57:21Z

BZBY
Nov 7, 2024

I've just tested the latest main branch functionality on Windows 11 or Ubuntu and encountered an issue. Here’s the test code I used:

python
async def main():
    async with AsyncWebCrawler(
            headless=False,  # Set to False to see what is happening
            use_managed_browser=True,
            browser_type="chromium",
    ) as crawler:
        result = await crawler.arun(
            url="https://www.youtube.com/",
            magic=True
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

When running this code, two browser windows open up: one displays Chrome's login screen, and the other loads the URL I specified. All subsequent operations happen in the second browser window, but closing it also causes the first browser window to close. This suggests that the two windows are instances of the same browser. However, when I add user data as follows:

python
async with AsyncWebCrawler(
        headless=False,
        use_managed_browser=True,
        user_data_dir=r"C:\Users\BZBY\AppData\Local\Google\Chrome\User Data",
        browser_type="chromium",
) as crawler:

The issue becomes apparent. The first window is my real browser instance, but the second window lacks my user data—it only has bookmark information and doesn’t display the user profile icon in the top right corner of Chrome. This means that the second window cannot access sites I’m already logged into, so I have to log in again.

Ideally, I should be able to open a browser with my actual user profile or use a command like:


bash
start chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\Users\BZBY\AppData\Local\Google\Chrome\User Data"

This command allows me to open a browser that I can access directly using playwright.chromium.connect_over_cdp(cdp_url) to interact with my existing open browser instance.

mukulchaudhary · 2024-11-07T05:14:57Z

mukulchaudhary
Nov 7, 2024

I'm also facing something similar. After I login to an application within the on_browser_created hook. Crawler opens a different window but it seems to miss the browser context and is unable to crawl internal page and shows the login page again. While it works on trying to open multiple pages on the same context after logging in one of the page within on_browser_created. Does this work for you in case you tried? Thanks.

0 replies

BZBY · 2024-11-07T06:54:05Z

BZBY
Nov 7, 2024
Author

I'm also facing something similar. After I login to an application within the on_browser_created hook. Crawler opens a different window but it seems to miss the browser context and is unable to crawl internal page and shows the login page again. While it works on trying to open multiple pages on the same context after logging in one of the page within on_browser_created. Does this work for you in case you tried? Thanks.我也面临着类似的事情。在我登录到 on_browser_created 钩子中的应用程序后。爬网程序会打开一个不同的窗口，但它似乎错过了浏览器上下文，无法爬取内部页面，并再次显示登录页面。虽然它可以在登录 on_browser_created 中的一个页面后尝试在同一上下文中打开多个页面。如果您尝试过，这对您有用吗？谢谢。

async with AsyncWebCrawler(
    headless=False,  # Set to False to see what is happening
    use_managed_browser=True,
    browser_type="chromium",
) as crawler:
    result = await crawler.arun(
        url="https://www.youtube.com/",  # -> breakpoint
    ...

I placed a breakpoint here, and it opened only one browser window. After logging in and resuming execution, it successfully scraped the content displayed after logging in. This is a potential solution.

0 replies

pttodv · 2024-11-07T07:59:14Z

pttodv
Nov 7, 2024

I'm also facing something similar. After I login to an application within the on_browser_created hook. Crawler opens a different window but it seems to miss the browser context and is unable to crawl internal page and shows the login page again. While it works on trying to open multiple pages on the same context after logging in one of the page within on_browser_created. Does this work for you in case you tried? Thanks.我也面临着类似的事情。在我登录到 on_browser_created 钩子中的应用程序后。爬网程序会打开一个不同的窗口，但它似乎错过了浏览器上下文，无法爬取内部页面，并再次显示登录页面。虽然它可以在登录 on_browser_created 中的一个页面后尝试在同一上下文中打开多个页面。如果您尝试过，这对您有用吗？谢谢。
async with AsyncWebCrawler(
    headless=False,  # Set to False to see what is happening
    use_managed_browser=True,
    browser_type="chromium",
) as crawler:
    result = await crawler.arun(
        url="https://www.youtube.com/",  # -> breakpoint
    ...
I placed a breakpoint here, and it opened only one browser window. After logging in and resuming execution, it successfully scraped the content displayed after logging in. This is a potential solution.

I think this is a different issue regarding hooks and authentication. However, I am also currently facing the exact issue with @mukulchaudhary. There are no means to grab on_browser_created's session_id (which is the current method of maintaining sessions while navigating).

The documentation stated to use the on_browser_created hook to log in a website requiring authentication. However, a new browser context after an AsyncWebCrawler().arun() kind of defeats the purpose of using said hook since I can't use the authenticated browser.

Image shows two browser contexts after following the documentation's Hooks & Auth for AsyncWebCrawler with Google Sign in. The authenticated browser was ignored, and a new browser requiring another login shows.

Is this the expected behavior when using on_browser_created hooks or is this a bug? Using a breakpoint didn't work in my case.

0 replies

mukulchaudhary · 2024-11-07T08:37:37Z

mukulchaudhary
Nov 7, 2024

I'm also facing something similar. After I login to an application within the on_browser_created hook. Crawler opens a different window but it seems to miss the browser context and is unable to crawl internal page and shows the login page again. While it works on trying to open multiple pages on the same context after logging in one of the page within on_browser_created. Does this work for you in case you tried? Thanks.我也面临着类似的事情。在我登录到 on_browser_created 钩子中的应用程序后。爬网程序会打开一个不同的窗口，但它似乎错过了浏览器上下文，无法爬取内部页面，并再次显示登录页面。虽然它可以在登录 on_browser_created 中的一个页面后尝试在同一上下文中打开多个页面。如果您尝试过，这对您有用吗？谢谢。
async with AsyncWebCrawler(
    headless=False,  # Set to False to see what is happening
    use_managed_browser=True,
    browser_type="chromium",
) as crawler:
    result = await crawler.arun(
        url="https://www.youtube.com/",  # -> breakpoint
    ...
I placed a breakpoint here, and it opened only one browser window. After logging in and resuming execution, it successfully scraped the content displayed after logging in. This is a potential solution.
I think this is a different issue regarding hooks and authentication. However, I am also currently facing the exact issue with @mukulchaudhary. There are no means to grab on_browser_created's session_id (which is the current method of maintaining sessions while navigating).

The documentation stated to use the on_browser_created hook to log in a website requiring authentication. However, a new browser context after an AsyncWebCrawler().arun() kind of defeats the purpose of using said hook since I can't use the authenticated browser.

Image shows two browser contexts after following the documentation's Hooks & Auth for AsyncWebCrawler with Google Sign in. The authenticated browser was ignored, and a new browser requiring another login shows.

Is this the expected behavior when using on_browser_created hooks or is this a bug? Using a breakpoint didn't work in my case.

Exactly. That's what is missing.

0 replies

mukulchaudhary · 2024-11-08T20:40:47Z

mukulchaudhary
Nov 8, 2024

@unclecode Please share your views/guidance on this when you get time. Thanks .

0 replies

BZBY · 2024-11-11T01:18:13Z

BZBY
Nov 11, 2024
Author

@unclecode Please share your views/guidance on this when you get time. Thanks .如果您有时间，请分享您对此的看法/指导。谢谢。

The new version seems to have resolved this issue. Please try installing the latest version from the main branch to see if the problem persists.

0 replies

pttodv · 2024-11-11T01:42:13Z

pttodv
Nov 11, 2024

@unclecode Please share your views/guidance on this when you get time. Thanks .如果您有时间，请分享您对此的看法/指导。谢谢。

The new version seems to have resolved this issue. Please try installing the latest version from the main branch to see if the problem persists.

The issue's still there -- documentation states that I can put my session_id as parameter for AsyncPlayWrightCrawlerStrategy. Planning to use said id to use the authenticated browser. However, I cannot for the life of me see it in the code.

Closest I can see here is sessions, which I know can't be added a session_id

0 replies

unclecode · 2024-11-11T12:02:50Z

unclecode
Nov 11, 2024
Maintainer

Hello everybody, I am here :)) let me go through this in detail.

First of all, @pttodv , there's no session ID. Instead, you pass the session ID to the crawl() function. Then, we use the sessions dictionary to maintain them in memory. So, there's nothing over there; you're searching in the wrong place.

@BZBY Regarding the manage browser setting, I recorded this video to make it more clear. First, make sure to set the bypass_cache to true in the crawl function, as not doing so will prevent the browser from opening, instead reading from the cache and only creating an instance of the browser.

issue_236.mp4

I will show two different cases: when passing a user data directory and when not.

How I created the folder?
I created the user data directory by first creating an empty folder, then opening Chrome in Terminal on a specific port and setting this user data directory. (passing --remote-debugging-port=9222 --user-data-dir="ADDRESS"). after running the command, a new browser popped up. I signed into my Google account, checked the YouTube website and other sites where I wanted to ensure I had my personal data. Then I shut down and closed the browsers; all the data was now in the directory

1/ With user data directory: In the first part of the video, I pass this folder and attempt to access the YouTube website, which opens perfectly. It logs in to YouTube with my account and extracts the data.

2/ Without user data directory: In the video's second part, I do this without passing the user data directory. You'll see it's a fresh account that doesn't show my personal youtube data. Although I can put a breakpoint or use a hook to wait, then return to the browser, log in, and perform actions before letting the crawl continue. Anyway in this scenario, a new user directory is created and all data is there.

I prefer the first option as it allows me to create multiple profiles for multiple purposes.

Now let me know if this helps. I don't face the issue of multiple browsers opening at the same time.

0 replies

mukulchaudhary · 2024-11-11T18:18:55Z

mukulchaudhary
Nov 11, 2024

It doesn't seem to work atleast in my case where I perform login actions in the on browser created hook, set the hook on AsyncPlaywrightCrawlerStrategy and use the strategy on AsyncWebCrawler to crawl an internal page as per the documentation. It takes me to the logoin page only. @unclecode Please provide an example when you can.

0 replies

pttodv · 2024-11-11T18:41:20Z

pttodv
Nov 11, 2024

It doesn't seem to work atleast in my case where I perform login actions in the on browser created hook, set the hook on AsyncPlaywrightCrawlerStrategy and use the strategy on AsyncWebCrawler to crawl an internal page as per the documentation. It takes me to the logoin page only. @unclecode Please provide an example when you can.

To add, I used crawl() to get the session_id like @unclecode said, and while it indeed allowed me to reuse a browser instance for AsyncWebCrawler use, the browser that was opened on on_browser_created() still won't let me use it.

Is there a way to use the browser instance opened during on_browser_created()'s execution? Here's the code I am using.

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from playwright.async_api import Page, Browser

async def main():
    async def on_browser_created(browser: Browser):
        print("[HOOK] on_browser_created")
        # Example customization: set browser viewport size
        context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
        page = await context.new_page()

        # Example customization: logging in to a hypothetical website
        await page.goto('https://google.com/')

        #Commented out the code below since I only wanted to know if the browser instance above is used for the next part of the code.
        #await page.fill('input[name="username"]', 'testuser')
        #await page.fill('input[name="password"]', 'password123')
        #await page.click('button[type="submit"]')
        #await page.wait_for_selector('#welcome')

        # Add a custom cookie
        await context.add_cookies([{'name': 'test_cookie', 'value': 'cookie_value', 'url': 'https://example.com'}])

        print("\n🔗 Using Crawler Hooks: Let's see how we can customize the AsyncWebCrawler using hooks!")

        # Close the browser. Also tried commenting these out in hopes of the browser instance persisting
        await page.close()
        await context.close()

    crawler_strategy = AsyncPlaywrightCrawlerStrategy(headless=False, verbose=True)  
    crawler_strategy.set_hook('on_browser_created', on_browser_created)

    async with AsyncWebCrawler(headless=False, verbose=False, crawler_strategy=crawler_strategy, use_managed_browser=True) as crawler:  
        await crawler_strategy.crawl('https://facebook.com', session_id="Test")
        print("crawler.crawler_strategy.sessions)
        result = await crawler.arun(
            url="https://example.com",
            bypass_cache=True,
            session_id="Test",
            js_code="document.elementFromPoint(window.innerWidth / 2, window.innerHeight / 2).click();",
            delay_before_return_html=2.0
        )
        result1 = await crawler.arun(
            url="https://www.iana.org/domains",
            bypass_cache=True,
            session_id="Test",
            js_code="document.elementFromPoint(window.innerWidth / 2, window.innerHeight / 2).click();",
            delay_before_return_html=2.0
        )
asyncio.run(main())

0 replies

arlse · 2024-11-18T07:47:45Z

arlse
Nov 18, 2024

I use Claude generate some code, and they worked for me, maybe can provide some help:

def create_browser_hook(strategy: AsyncPlaywrightCrawlerStrategy):
    async def on_browser_created(browser: Browser):
        print("[HOOK] on_browser_created")
        session_id = "session_0"

        with open('cookies.json') as f:
            cookies = json.load(f)

        context = await browser.new_context(
            viewport={'width': 1920, 'height': 1080}
        )

        await context.add_cookies(cookies)

        page = await context.new_page()

        strategy.sessions[session_id] = (context, page, time.time())

        # await page.close()
        # await context.close()

    return on_browser_created


async def add_cookies_and_crawl():
    print("\n🔗 Using Crawler Hooks: Let's see how we can customize the AsyncWebCrawler using hooks!")
    
    crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True, headless=False)
    hook = create_browser_hook(crawler_strategy)
    crawler_strategy.set_hook('on_browser_created', hook)

    async with AsyncWebCrawler(
            verbose=True,
            crawler_strategy=crawler_strategy,
            use_manual_browser=True,
            browser_type="chromium",
            sleep_on_close= True,
            # user_data_dir= "./user_data_dir"
    ) as crawler:
        result = await crawler.arun(
            url="https://www.google.com",
            bypass_cache=True,
            # js_code=js_code,
            magic=True,
            session_id="session_0",
        )

        print(result.markdown)

asyncio.run(add_cookies_and_crawl())

0 replies

unclecode · 2024-11-20T11:01:16Z

unclecode
Nov 20, 2024
Maintainer

@pttodv Sorry for my delay in response. I've been very busy preparing 0.3.74. I'm trying to repeat your situation here, but I still wasn't successful. Could you try the new version? In the new version, use managed_browser and pass a user_data_dir.

Session ID: when you don't pass any session_id to your arun function, set the headless option to false, you will see that it creates two tabs in the same browser. However when you pass the same session ID, it's basically using the same tab and preserving that one. But in none of these cases should you have two browsers.

And remember, if you don't pass any user directory (user profile directory), it will open one browser and one pop-up window asking you to choose your profile. If you click on a different profile, you may get two browsers. I don't know if this is the case you're facing or not. Please try and let me know.

0 replies

Uh oh!

user data crawling opens two windows, unable to control correct user browser #545

Uh oh!

Replies: 12 comments

Uh oh!

Uh oh!

BZBY Nov 7, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BZBY Nov 11, 2024 Author

Uh oh!

Uh oh!

unclecode Nov 11, 2024 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

unclecode Nov 20, 2024 Maintainer

BZBY
Nov 7, 2024
Author

BZBY
Nov 11, 2024
Author

unclecode
Nov 11, 2024
Maintainer

unclecode
Nov 20, 2024
Maintainer