Getting External Data¶
html_table parser¶
- Read the web page, create a list: which contain one or more dataframes that maps to each html table found
- Auto detect column header
Auto create index using number starting from 0
read_html(url) # return list of dataframe(s) that maps to web table(s) structure
In [227]:
df_list = pd.read_html('https://www.bloomberg.com/markets/currencies')
print ('Total Table(s) Found : ', len(df_list))
df = df_list[0]
print (df)
Total Table(s) Found : 1
Currency Value Change Net Change Time (EST) \
0 EUR-USD 1.2477 0.0026 +0.21% 6:20 AM
1 USD-JPY 106.6300 -0.3800 -0.36% 6:20 AM
2 GBP-USD 1.4050 0.0051 +0.36% 6:20 AM
3 AUD-USD 0.7930 0.0004 +0.05% 6:20 AM
4 USD-CAD 1.2490 -0.0003 -0.02% 6:21 AM
.. ... ... ... ... ...
6 EUR-JPY 133.0400 -0.2000 -0.15% 6:20 AM
7 EUR-GBP 0.8880 -0.0014 -0.16% 6:20 AM
8 USD-HKD 7.8219 -0.0005 -0.01% 6:20 AM
9 EUR-CHF 1.1550 -0.0022 -0.19% 6:20 AM
10 USD-KRW 1072.8200 -4.3800 -0.41% 1:29 AM
2 Day
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
.. ...
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
[11 rows x 6 columns]
Reading Local CSV File¶
Import from Local File (Basic)¶
In [228]:
goo = pd.read_csv('data/goog.csv')
In [229]:
goo.head()
Out[229]:
Date Open High Low \
0 12/19/2016 790.219971 797.659973 786.270020
1 12/20/2016 796.760010 798.650024 793.270020
2 12/21/2016 795.840027 796.676025 787.099976
3 12/22/2016 792.359985 793.320007 788.580017
4 12/23/2016 790.900024 792.739990 787.280029
Close Volume
0 794.200012 1225900
1 796.419983 925100
2 794.559998 1208700
3 791.260010 969100
4 789.909973 623400
In [230]:
goo.info
Out[230]:
<bound method DataFrame.info of Date Open High Low \
0 12/19/2016 790.219971 797.659973 786.270020
1 12/20/2016 796.760010 798.650024 793.270020
2 12/21/2016 795.840027 796.676025 787.099976
3 12/22/2016 792.359985 793.320007 788.580017
4 12/23/2016 790.900024 792.739990 787.280029
.. ... ... ... ...
56 3/13/2017 844.000000 848.684998 843.250000
57 3/14/2017 843.640015 847.239990 840.799988
58 3/15/2017 847.590027 848.630005 840.770020
59 3/16/2017 849.030029 850.849976 846.130005
60 3/17/2017 851.609985 853.400024 847.109985
Close Volume
0 794.200012 1225900
1 796.419983 925100
2 794.559998 1208700
3 791.260010 969100
4 789.909973 623400
.. ... ...
56 845.539978 1149500
57 845.619995 779900
58 847.200012 1379600
59 848.780029 970400
60 852.119995 1712300
[61 rows x 6 columns]>
Specify Index and Parsing During Import¶
In [231]:
goo = pd.read_csv('data/goog.csv',index_col='Date', parse_dates=['Date'])
In [232]:
goo.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 61 entries, 2016-12-19 to 2017-03-17 Data columns (total 5 columns): Open 61 non-null float64 High 61 non-null float64 Low 61 non-null float64 Close 61 non-null float64 Volume 61 non-null int64 dtypes: float64(4), int64(1) memory usage: 2.9 KB
In [233]:
goo.head()
Out[233]:
Open High Low \
Date
2016-12-19 790.219971 797.659973 786.270020
2016-12-20 796.760010 798.650024 793.270020
2016-12-21 795.840027 796.676025 787.099976
2016-12-22 792.359985 793.320007 788.580017
2016-12-23 790.900024 792.739990 787.280029
Close Volume
Date
2016-12-19 794.200012 1225900
2016-12-20 796.419983 925100
2016-12-21 794.559998 1208700
2016-12-22 791.260010 969100
2016-12-23 789.909973 623400
In [234]:
goo.index
Out[234]:
DatetimeIndex(['2016-12-19', '2016-12-20', '2016-12-21',
'2016-12-22', '2016-12-23', '2016-12-27',
'2016-12-28', '2016-12-29', '2016-12-30',
'2017-01-03', '2017-01-04', '2017-01-05',
'2017-01-06', '2017-01-09', '2017-01-10',
'2017-01-11', '2017-01-12', '2017-01-13',
'2017-01-17', '2017-01-18', '2017-01-19',
'2017-01-20', '2017-01-23', '2017-01-24',
'2017-01-25', '2017-01-26', '2017-01-27',
'2017-01-30', '2017-01-31', '2017-02-01',
'2017-02-02', '2017-02-03', '2017-02-06',
'2017-02-07', '2017-02-08', '2017-02-09',
'2017-02-10', '2017-02-13', '2017-02-14',
'2017-02-15', '2017-02-16', '2017-02-17',
'2017-02-21', '2017-02-22', '2017-02-23',
'2017-02-24', '2017-02-27', '2017-02-28',
'2017-03-01', '2017-03-02', '2017-03-03',
'2017-03-06', '2017-03-07', '2017-03-08',
'2017-03-09', '2017-03-10', '2017-03-13',
'2017-03-14', '2017-03-15', '2017-03-16',
'2017-03-17'],
dtype='datetime64[ns]', name='Date', freq=None)
In [235]:
goo.Close.max()
Out[235]:
852.1199949999999
In [236]:
goo.Close.plot()
Out[236]:
<matplotlib.axes._subplots.AxesSubplot at 0x2c318220978>
In [237]:
ny = pd.read_csv('nyc_311_data_subset-2.csv')
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
Basic Validation
In [238]:
ny.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 971063 entries, 0 to 971062 Data columns (total 8 columns): Unique Key 971063 non-null int64 Created Date 971063 non-null object Closed Date 882944 non-null object Agency 971063 non-null object Incident Zip 911140 non-null object Borough 971063 non-null object Latitude 887284 non-null float64 Longitude 887284 non-null float64 dtypes: float64(2), int64(1), object(5) memory usage: 59.3+ MB
In [239]:
ny[:3]
Out[239]:
Unique Key Created Date Closed Date \
0 1 10/11/2016 11:53 10/11/2016 12:00
1 2 10/11/2016 11:36 10/11/2016 12:00
2 3 10/11/2016 11:36 10/11/2016 12:00
... Borough Latitude Longitude
0 ... QUEENS NaN NaN
1 ... QUEENS NaN NaN
2 ... QUEENS NaN NaN
[3 rows x 8 columns]
Import and Specify Index Column
In [240]:
ny = pd.read_csv('nyc_311_data_subset-2.csv',index_col='Unique Key')
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
In [241]:
ny.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 971063 entries, 1 to 971063 Data columns (total 7 columns): Created Date 971063 non-null object Closed Date 882944 non-null object Agency 971063 non-null object Incident Zip 911140 non-null object Borough 971063 non-null object Latitude 887284 non-null float64 Longitude 887284 non-null float64 dtypes: float64(2), object(5) memory usage: 59.3+ MB
In [242]:
ny[:3]
Out[242]:
Created Date Closed Date Agency \
Unique Key
1 10/11/2016 11:53 10/11/2016 12:00 DSNY
2 10/11/2016 11:36 10/11/2016 12:00 DSNY
3 10/11/2016 11:36 10/11/2016 12:00 DSNY
Incident Zip Borough Latitude Longitude
Unique Key
1 NaN QUEENS NaN NaN
2 NaN QUEENS NaN NaN
3 NaN QUEENS NaN NaN