Reading a CSV file directly from a URL into Pandas is a common task, especially when dealing with web data.
However, sometimes the data you need requires authentication to access. Fortunately, Python and Pandas provide straightforward methods to handle this scenario. Let's explore how to read a CSV from a URL with authentication using Pandas.
Authentication with requests
If the URL requires authentication, you'll need to provide credentials to access it. This typically involves passing a username and password or an access token. Python requests supports various authentication methods, including HTTP basic authentication and token-based authentication.
For example, if the URL requires basic authentication, you can use the requests library to pass the credentials:
import requests
import pandas as pd
from io import StringIO
from requests.auth import HTTPBasicAuth
url = 'https://example.com/items/data.json'
user = 'username'
password = 'password'
data = requests.get(url, auth=HTTPBasicAuth(user, password))
df = pd.read_json(StringIO(data.text), lines=True)
df
the same applies for read_csv
method.
Authentication with custom headers
Pandas offers parameter storage_options
for:
read_csv
read_json methods
We can use custom headers to provide authentication information to pandas by generating authentication header like: {'Authorization': 'Basic xxxx'}
read_json + storage_options
The following example provides how this can be done:
from http.client import HTTPSConnection
from base64 import b64encode
import base64
def basic_auth(username, password):
token = b64encode(f"{username}:{password}".encode('utf-8')).decode("ascii")
return f'Basic {token}'
username = "username"
password = "password"
headers = { 'Authorization' : basic_auth(username, password) }
headers
df = pd.read_json(
"https://example.com/items/data.json",
storage_options=headers, lines=True
)
read remote CSV file
Below you can find shorter version for read_csv
method:
import pandas as pd
from base64 import b64encode
df = pd.read_csv(
'https://example.com/items/data.csv',
storage_options={'Authorization': b'Basic %s' % b64encode(b'username:password')})
df
Read files from amazon buckets
public bucket
To read data from amazon public buckets we need first to install library by - s3fs
:
pip install s3fs
then we can use:
import pandas as pd
pd.read_csv(
"s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013"
"-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
storage_options={"anon": True}
)
private buckets
as alternative we can use Pandas read_csv
method with AWS accounts to read remote data:
df = pd.read_csv("s3://my-private-bucket/data.csv")
We can pass the amazon keys and secrets by:
df = pd.read_csv(
"s3://my-private-bucket/data.csv",
storage_options={"key": "AKIAIOSFODNN7EXAMPLE", "secret": "SECRET"},
)
or by using boto3 library:
import os
import pandas as pd
import boto3
session = boto3.Session(profile_name="test")
os.environ['AWS_ACCESS_KEY_ID'] = session.get_credentials().access_key
os.environ['AWS_SECRET_ACCESS_KEY'] = session.get_credentials().secret_key
df = pd.read_csv("s3://xxxx.csv")
Conclusion
Reading a CSV from a URL with authentication in Pandas is a straightforward process.
By following the steps outlined above, you can access data hosted on the web securely and leverage the powerful data manipulation capabilities of Pandas for your analysis.
Resources
You can learn more about reading remote files with Pandas here: