Getting External Data

html_table parser

  • Read the web page, create a list: which contain one or more dataframes that maps to each html table found
  • Auto detect column header
  • Auto create index using number starting from 0

    read_html(url)  # return list of dataframe(s) that maps to web table(s) structure
    

In [227]:

df_list = pd.read_html('https://www.bloomberg.com/markets/currencies')
print ('Total Table(s) Found : ', len(df_list))
df = df_list[0]
print (df)
Total Table(s) Found :  1
   Currency      Value  Change Net Change Time (EST)  \
0   EUR-USD     1.2477  0.0026     +0.21%    6:20 AM   
1   USD-JPY   106.6300 -0.3800     -0.36%    6:20 AM   
2   GBP-USD     1.4050  0.0051     +0.36%    6:20 AM   
3   AUD-USD     0.7930  0.0004     +0.05%    6:20 AM   
4   USD-CAD     1.2490 -0.0003     -0.02%    6:21 AM   
..      ...        ...     ...        ...        ...   
6   EUR-JPY   133.0400 -0.2000     -0.15%    6:20 AM   
7   EUR-GBP     0.8880 -0.0014     -0.16%    6:20 AM   
8   USD-HKD     7.8219 -0.0005     -0.01%    6:20 AM   
9   EUR-CHF     1.1550 -0.0022     -0.19%    6:20 AM   
10  USD-KRW  1072.8200 -4.3800     -0.41%    1:29 AM   

    2 Day  
0     NaN  
1     NaN  
2     NaN  
3     NaN  
4     NaN  
..    ...  
6     NaN  
7     NaN  
8     NaN  
9     NaN  
10    NaN  

[11 rows x 6 columns]

Reading Local CSV File

Import from Local File (Basic)

In [228]:

goo = pd.read_csv('data/goog.csv')

In [229]:

goo.head()

Out[229]:

         Date        Open        High         Low  \
0  12/19/2016  790.219971  797.659973  786.270020   
1  12/20/2016  796.760010  798.650024  793.270020   
2  12/21/2016  795.840027  796.676025  787.099976   
3  12/22/2016  792.359985  793.320007  788.580017   
4  12/23/2016  790.900024  792.739990  787.280029   

        Close   Volume  
0  794.200012  1225900  
1  796.419983   925100  
2  794.559998  1208700  
3  791.260010   969100  
4  789.909973   623400  

In [230]:

goo.info

Out[230]:

<bound method DataFrame.info of           Date        Open        High         Low  \
0   12/19/2016  790.219971  797.659973  786.270020   
1   12/20/2016  796.760010  798.650024  793.270020   
2   12/21/2016  795.840027  796.676025  787.099976   
3   12/22/2016  792.359985  793.320007  788.580017   
4   12/23/2016  790.900024  792.739990  787.280029   
..         ...         ...         ...         ...   
56   3/13/2017  844.000000  848.684998  843.250000   
57   3/14/2017  843.640015  847.239990  840.799988   
58   3/15/2017  847.590027  848.630005  840.770020   
59   3/16/2017  849.030029  850.849976  846.130005   
60   3/17/2017  851.609985  853.400024  847.109985   

         Close   Volume  
0   794.200012  1225900  
1   796.419983   925100  
2   794.559998  1208700  
3   791.260010   969100  
4   789.909973   623400  
..         ...      ...  
56  845.539978  1149500  
57  845.619995   779900  
58  847.200012  1379600  
59  848.780029   970400  
60  852.119995  1712300  

[61 rows x 6 columns]>

Specify Index and Parsing During Import

In [231]:

goo = pd.read_csv('data/goog.csv',index_col='Date', parse_dates=['Date'])

In [232]:

goo.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 61 entries, 2016-12-19 to 2017-03-17
Data columns (total 5 columns):
Open      61 non-null float64
High      61 non-null float64
Low       61 non-null float64
Close     61 non-null float64
Volume    61 non-null int64
dtypes: float64(4), int64(1)
memory usage: 2.9 KB

In [233]:

goo.head()

Out[233]:

                  Open        High         Low  \
Date                                             
2016-12-19  790.219971  797.659973  786.270020   
2016-12-20  796.760010  798.650024  793.270020   
2016-12-21  795.840027  796.676025  787.099976   
2016-12-22  792.359985  793.320007  788.580017   
2016-12-23  790.900024  792.739990  787.280029   

                 Close   Volume  
Date                             
2016-12-19  794.200012  1225900  
2016-12-20  796.419983   925100  
2016-12-21  794.559998  1208700  
2016-12-22  791.260010   969100  
2016-12-23  789.909973   623400  

In [234]:

goo.index

Out[234]:

DatetimeIndex(['2016-12-19', '2016-12-20', '2016-12-21',
               '2016-12-22', '2016-12-23', '2016-12-27',
               '2016-12-28', '2016-12-29', '2016-12-30',
               '2017-01-03', '2017-01-04', '2017-01-05',
               '2017-01-06', '2017-01-09', '2017-01-10',
               '2017-01-11', '2017-01-12', '2017-01-13',
               '2017-01-17', '2017-01-18', '2017-01-19',
               '2017-01-20', '2017-01-23', '2017-01-24',
               '2017-01-25', '2017-01-26', '2017-01-27',
               '2017-01-30', '2017-01-31', '2017-02-01',
               '2017-02-02', '2017-02-03', '2017-02-06',
               '2017-02-07', '2017-02-08', '2017-02-09',
               '2017-02-10', '2017-02-13', '2017-02-14',
               '2017-02-15', '2017-02-16', '2017-02-17',
               '2017-02-21', '2017-02-22', '2017-02-23',
               '2017-02-24', '2017-02-27', '2017-02-28',
               '2017-03-01', '2017-03-02', '2017-03-03',
               '2017-03-06', '2017-03-07', '2017-03-08',
               '2017-03-09', '2017-03-10', '2017-03-13',
               '2017-03-14', '2017-03-15', '2017-03-16',
               '2017-03-17'],
              dtype='datetime64[ns]', name='Date', freq=None)

In [235]:

goo.Close.max()

Out[235]:

852.1199949999999

In [236]:

goo.Close.plot()

Out[236]:

<matplotlib.axes._subplots.AxesSubplot at 0x2c318220978>

In [237]:

ny = pd.read_csv('nyc_311_data_subset-2.csv')
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Basic Validation

In [238]:

ny.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 971063 entries, 0 to 971062
Data columns (total 8 columns):
Unique Key      971063 non-null int64
Created Date    971063 non-null object
Closed Date     882944 non-null object
Agency          971063 non-null object
Incident Zip    911140 non-null object
Borough         971063 non-null object
Latitude        887284 non-null float64
Longitude       887284 non-null float64
dtypes: float64(2), int64(1), object(5)
memory usage: 59.3+ MB

In [239]:

ny[:3]

Out[239]:

   Unique Key      Created Date       Closed Date  \
0           1  10/11/2016 11:53  10/11/2016 12:00   
1           2  10/11/2016 11:36  10/11/2016 12:00   
2           3  10/11/2016 11:36  10/11/2016 12:00   

     ...    Borough Latitude Longitude  
0    ...     QUEENS      NaN       NaN  
1    ...     QUEENS      NaN       NaN  
2    ...     QUEENS      NaN       NaN  

[3 rows x 8 columns]

Import and Specify Index Column

In [240]:

ny = pd.read_csv('nyc_311_data_subset-2.csv',index_col='Unique Key')
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [241]:

ny.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 971063 entries, 1 to 971063
Data columns (total 7 columns):
Created Date    971063 non-null object
Closed Date     882944 non-null object
Agency          971063 non-null object
Incident Zip    911140 non-null object
Borough         971063 non-null object
Latitude        887284 non-null float64
Longitude       887284 non-null float64
dtypes: float64(2), object(5)
memory usage: 59.3+ MB

In [242]:

ny[:3]

Out[242]:

                Created Date       Closed Date Agency  \
Unique Key                                              
1           10/11/2016 11:53  10/11/2016 12:00   DSNY   
2           10/11/2016 11:36  10/11/2016 12:00   DSNY   
3           10/11/2016 11:36  10/11/2016 12:00   DSNY   

           Incident Zip Borough  Latitude  Longitude  
Unique Key                                            
1                   NaN  QUEENS       NaN        NaN  
2                   NaN  QUEENS       NaN        NaN  
3                   NaN  QUEENS       NaN        NaN  

results matching ""

    No results matching ""