Skip to content

Data Loader

Data Loader (payn.DataLoader.DataLoader)

Provides a unified abstraction layer for data ingestion. It decouples the downstream pipeline from specific file formats, automatically inferring the correct parsing strategy (CSV vs. Excel) based on file extensions.

  • Dynamic: The class accepts dynamic keyword arguments (**kwargs), allowing users to pass format-specific parameters (such as sheet_name for Excel or delimiter for CSV) directly from the central configuration file without requiring code changes.

Class used for importing data from various file formats into Pandas DataFrames.

Acts as a wrapper around pandas I/O functions, providing a consistent interface for loading data regardless of the underlying file format (CSV or Excel).

Attributes:

Name Type Description
file_path str

The absolute or relative path to the input file.

kwargs Dict[str, Any]

Additional keyword arguments passed directly to the underlying pandas reading function (e.g., sheet_name, header).

file_type str

The inferred file type ('excel' or 'csv').

Source code in payn\DataLoader\dataloader.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
class DataLoader:
    """
    Class used for importing data from various file formats into Pandas DataFrames.

    Acts as a wrapper around pandas I/O functions, providing a consistent interface
    for loading data regardless of the underlying file format (CSV or Excel).

    Attributes:
        file_path (str): The absolute or relative path to the input file.
        kwargs (Dict[str, Any]): Additional keyword arguments passed directly to
            the underlying pandas reading function (e.g., `sheet_name`, `header`).
        file_type (str): The inferred file type ('excel' or 'csv').
    """

    def __init__(self, file_path: str, **kwargs):
        """
        Initialize the DataLoader class.

        Args:
            file_path (str): Path to the input file.
            **kwargs (Dict[str, Any]): Additional arguments for pandas read functions.
                      Common examples include `sheet_name` for Excel or `sep` for CSV.

        Raises:
            ValueError: If the file extension is not supported.
        """
        self.file_path = file_path
        self.kwargs = kwargs
        self.file_type = self._infer_file_type()

    def _infer_file_type(self) -> str:
        """
        Infer file type based on the file extension.

        Returns:
            str: File type ('excel' or 'csv').

        Raises:
            ValueError: If the file extension is not among the supported types
                        (.xls, .xlsx, .csv).
        """
        _, ext = os.path.splitext(self.file_path)
        ext = ext.lower()

        if ext in ['.xls', '.xlsx']:
            return 'excel'
        elif ext == '.csv':
            return 'csv'
        else:
            raise ValueError(f"Unsupported file extension: '{ext}'. Only .csv, .xls, and .xlsx are supported.")

    def load_data(self) -> pd.DataFrame:
        """
        Load data from the specified file path based on the inferred type.

        Returns:
            pd.DataFrame: A DataFrame containing the loaded data.

        Raises:
            FileNotFoundError: If the file does not exist (propagated from pandas).
            ValueError: If an internal logic error occurs regarding file type.
        """
        if self.file_type == 'excel':
            return self._load_excel()
        elif self.file_type == 'csv':
            return self._load_csv()
        else:
            raise ValueError(f"Unsupported file type: {self.file_type}")

    def _load_excel(self) -> pd.DataFrame:
        """
        Private method to load data from an Excel file.

        Returns:
            pd.DataFrame: DataFrame containing data read from Excel.
        """
        return pd.read_excel(self.file_path, **self.kwargs)

    def _load_csv(self) -> pd.DataFrame:
        """
        Private method to load data from a CSV file.

        Returns:
            pd.DataFrame: DataFrame containing data read from CSV.
        """
        return pd.read_csv(self.file_path, **self.kwargs)

__init__(file_path, **kwargs)

Initialize the DataLoader class.

Parameters:

Name Type Description Default
file_path str

Path to the input file.

required
**kwargs Dict[str, Any]

Additional arguments for pandas read functions. Common examples include sheet_name for Excel or sep for CSV.

{}

Raises:

Type Description
ValueError

If the file extension is not supported.

Source code in payn\DataLoader\dataloader.py
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def __init__(self, file_path: str, **kwargs):
    """
    Initialize the DataLoader class.

    Args:
        file_path (str): Path to the input file.
        **kwargs (Dict[str, Any]): Additional arguments for pandas read functions.
                  Common examples include `sheet_name` for Excel or `sep` for CSV.

    Raises:
        ValueError: If the file extension is not supported.
    """
    self.file_path = file_path
    self.kwargs = kwargs
    self.file_type = self._infer_file_type()

load_data()

Load data from the specified file path based on the inferred type.

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing the loaded data.

Raises:

Type Description
FileNotFoundError

If the file does not exist (propagated from pandas).

ValueError

If an internal logic error occurs regarding file type.

Source code in payn\DataLoader\dataloader.py
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
def load_data(self) -> pd.DataFrame:
    """
    Load data from the specified file path based on the inferred type.

    Returns:
        pd.DataFrame: A DataFrame containing the loaded data.

    Raises:
        FileNotFoundError: If the file does not exist (propagated from pandas).
        ValueError: If an internal logic error occurs regarding file type.
    """
    if self.file_type == 'excel':
        return self._load_excel()
    elif self.file_type == 'csv':
        return self._load_csv()
    else:
        raise ValueError(f"Unsupported file type: {self.file_type}")