https://github.com/shah0150/grab_data

python scrapping-python

Last synced: 8 months ago
JSON representation

Host: GitHub
URL: https://github.com/shah0150/grab_data
Owner: shah0150
Created: 2017-06-07T14:54:50.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2017-06-07T15:15:08.000Z (almost 9 years ago)
Last Synced: 2025-02-21T13:44:00.057Z (about 1 year ago)
Topics: python, scrapping-python
Language: Python
Size: 1000 Bytes
Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Easy way to scrap data 

Let’s say you are searching the web for some raw data you need for a project and you stumble across a webpage 

But the bad news is that the data lives inside a web page and there’s no API that you can use to grab the raw data. So now you have to waste 30 minutes throwing together a crappy script to download and parse out the data. It’s not hard, but it’s a waste of time that you could spend on something useful. And somehow 30 minutes always ends up being 2 hours.

Luckily, there’s a super simple answer. The Pandas library has a built-in method to extract tabular data from html pages called read_html():

```python

import pandas as pd

tables = pd.read_html("http://apps.sandiego.gov/sdfiredispatch/")

print(tables[0])

```

It’s that simple! Pandas will find any significant html tables on the page and return each one as a new DataFrame object.

To upgrade our program from toy to real, let’s tell Pandas that row 0 of the table has column headers and ask it to convert text-based dates into time objects:

```python

import pandas as pd

calls_df, = pd.read_html("http://apps.sandiego.gov/sdfiredispatch/", header=0, parse_dates=["Call Date"])

print(calls_df)

```

And how that the data lives in a DataFrame, the world is yours. Wish the data was available as json records? That’s just one more line of code!

```python

import pandas as pd

calls_df, = pd.read_html("http://apps.sandiego.gov/sdfiredispatch/", header=0, parse_dates=["Call Date"])

print(calls_df.to_json(orient="records", date_format="iso"))

```

You can even save the data right to a CSV or XLS file:

```python

import pandas as pd

calls_df, = pd.read_html("http://apps.sandiego.gov/sdfiredispatch/", header=0, parse_dates=["Call Date"])

calls_df.to_csv("calls.csv", index=False)

```

None of this is rocket science or anything, but I use it so often that I thought it was worth sharing. Have fun!

# Question you might have

calls_df, = pd.read_html(…`

What is the purpose of the comma after the variable name?

--> This is tuple unpacking. The expression on the right side of the equals sign returns a tuple of values, and we can easily unpack it into variables without having to write an extra line of code.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shah0150/grab_data

Awesome Lists containing this project

README