Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions webscraping-and-apis.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -487,9 +487,9 @@
"\n",
"Often there are times when you don't actually want to scrape an entire webpage and all you want is the data from a *table* within the page. Fortunately, there is an easy way to scrape individual tables using the **pandas** package.\n",
"\n",
"We will read data from the first table on 'https://simple.wikipedia.org/wiki/FIFA_World_Cup' using **pandas**. The function we'll use is `read_html()`, which returns a list of data frames of all the tables it finds when you pass it a URL. If you want to filter the list of tables, use the `match=` keyword argument with text that only appears in the table(s) you're interested in.\n",
"We will read data from a table on 'https://webscraper.io/test-sites/tables' using **pandas**. The function we'll use is `read_html()`, which returns a list of data frames of all the tables it finds when you pass it a URL. If you want to filter the list of tables, use the `match=` keyword argument with text that only appears in the table(s) you're interested in.\n",
"\n",
"The example below shows how this works; looking at the website, we can see that the table we're interested in (of past world cup results), has a 'fourth place' column while other tables on the page do not. Therefore we run:"
"The example below shows how this works; looking at the website, we can see that the table we're interested in, has a 'First Name' column. Therefore we run:"
]
},
{
Expand All @@ -499,10 +499,8 @@
"metadata": {},
"outputs": [],
"source": [
"df_list = pd.read_html(\n",
" \"https://simple.wikipedia.org/wiki/FIFA_World_Cup\", match=\"Sweden\"\n",
")\n",
"# Retrieve first and only entry from list of data frames\n",
"df_list = pd.read_html(\"https://webscraper.io/test-sites/tables\", match=\"First Name\")\n",
"# Retrieve first entry from list of data frames\n",
"df = df_list[0]\n",
"df.head()"
]
Expand All @@ -513,7 +511,9 @@
"id": "31e49317",
"metadata": {},
"source": [
"This gives us the table neatly loaded into a **pandas** data frame ready for further use."
"This gives us the table neatly loaded into a **pandas** data frame ready for further use.\n",
"\n",
"If you get a '403' error, it means that the website has blocked **pandas** because it can see that you are engaged in web scraping. This is because some people web scrape irresponsibly, or because websites have provided other, preferred ways for you to obtain the data, eg via a download of the whole thing (think Wikipedia) or through an API. (If you really need to, [you can often get around the 403 error](https://stackoverflow.com/questions/43590153/http-error-403-forbidden-when-reading-html) though.)"
]
}
],
Expand Down