diff --git a/webscraping-and-apis.ipynb b/webscraping-and-apis.ipynb index c5cae45..c84214e 100644 --- a/webscraping-and-apis.ipynb +++ b/webscraping-and-apis.ipynb @@ -487,9 +487,9 @@ "\n", "Often there are times when you don't actually want to scrape an entire webpage and all you want is the data from a *table* within the page. Fortunately, there is an easy way to scrape individual tables using the **pandas** package.\n", "\n", - "We will read data from the first table on 'https://simple.wikipedia.org/wiki/FIFA_World_Cup' using **pandas**. The function we'll use is `read_html()`, which returns a list of data frames of all the tables it finds when you pass it a URL. If you want to filter the list of tables, use the `match=` keyword argument with text that only appears in the table(s) you're interested in.\n", + "We will read data from a table on 'https://webscraper.io/test-sites/tables' using **pandas**. The function we'll use is `read_html()`, which returns a list of data frames of all the tables it finds when you pass it a URL. If you want to filter the list of tables, use the `match=` keyword argument with text that only appears in the table(s) you're interested in.\n", "\n", - "The example below shows how this works; looking at the website, we can see that the table we're interested in (of past world cup results), has a 'fourth place' column while other tables on the page do not. Therefore we run:" + "The example below shows how this works; looking at the website, we can see that the table we're interested in, has a 'First Name' column. Therefore we run:" ] }, { @@ -499,10 +499,8 @@ "metadata": {}, "outputs": [], "source": [ - "df_list = pd.read_html(\n", - " \"https://simple.wikipedia.org/wiki/FIFA_World_Cup\", match=\"Sweden\"\n", - ")\n", - "# Retrieve first and only entry from list of data frames\n", + "df_list = pd.read_html(\"https://webscraper.io/test-sites/tables\", match=\"First Name\")\n", + "# Retrieve first entry from list of data frames\n", "df = df_list[0]\n", "df.head()" ] @@ -513,7 +511,9 @@ "id": "31e49317", "metadata": {}, "source": [ - "This gives us the table neatly loaded into a **pandas** data frame ready for further use." + "This gives us the table neatly loaded into a **pandas** data frame ready for further use.\n", + "\n", + "If you get a '403' error, it means that the website has blocked **pandas** because it can see that you are engaged in web scraping. This is because some people web scrape irresponsibly, or because websites have provided other, preferred ways for you to obtain the data, eg via a download of the whole thing (think Wikipedia) or through an API. (If you really need to, [you can often get around the 403 error](https://stackoverflow.com/questions/43590153/http-error-403-forbidden-when-reading-html) though.)" ] } ],