olivia_finder.package_manager

Initialize a package manager

Note:

Initialization based on a scraper-type datasource involves initializing the data prior to its use.

Initialization based on a CSV-type datasource already contains all the data and can be retrieved directly.

Loading from a persistence file implies that the file contains an object that has already been initialized or already contains data.

A bioconductor scraping based package manager

from olivia_finder.package_manager import PackageManager
bioconductor_pm_scraper = PackageManager(
    data_sources=[                  # List of data sources
        BioconductorScraper(),
    ]
)

A cran package manager loaded from a csv file

cran_pm_csv = PackageManager(
    data_sources=[                  # List of data sources
        CSVDataSource(
            # Path to the CSV file
            "aux_data/cran_adjlist_test.csv",
            dependent_field="Project Name",
            dependency_field="Dependency Name",
            dependent_version_field="Version Number",
        )
    ]
)

# Is needed to initialize the package manager to fill the package list with the csv data
cran_pm_csv.initialize(show_progress=True)
Loading packages: 100%|██████████| 275/275 [00:00<00:00, 729.91packages/s]

A pypi package manager loaded from persistence file

bioconductor_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/bioconductor_scraper.olvpm")

A Maven package manager loaded from librariesio api

maven_pm_libio = PackageManager(
    data_sources=[                  # List of data sources
        LibrariesioDataSource(platform="maven")
    ]
)

For scraping-based datasources: Initialize the structure with the data of the selected sources

Note:

The automatic obtaining of bioconductor packages as mentioned above depends on Selenium, which requires a pre-installed browser in the system, in our case Firefox.

It is possible that if you are running this notebook from a third-party Jupyter server, do not have a browser available

As a solution to this problem it is proposed to use the package_names parameter, in this way we can add a list of packages and the process can be continued

# bioconductor_pm_scraper.initialize(show_progress=True)

Note: If we do not provide a list of packages it will be obtained automatically if that functionality is implemented in datasource

Initialization of the bioconductor package manager using package list

# Initialize the package list
bioconductor_package_list = []
with open('../results/package_lists/bioconductor_scraped.txt', 'r') as file:
    bioconductor_package_list = file.read().splitlines()

# Initialize the package manager with the package list
bioconductor_pm_scraper.initialize(show_progress=True, package_names=bioconductor_package_list[:10])
Loading packages: 100%|██████████| 10/10 [00:06<00:00,  1.43packages/s]

Initialization of the Pypi package manager

pypi_pm_scraper = PackageManager(
    data_sources=[                  # List of data sources
        PypiScraper(),
    ]
)

pypi_package_list = []
with open('../results/package_lists/pypi_scraped.txt', 'r') as file:
    pypi_package_list = file.read().splitlines()

# Initialize the package manager
pypi_pm_scraper.initialize(show_progress=True, package_names=pypi_package_list[:10])

# Save the package manager
pypi_pm_scraper.save(path="aux_data/pypi_pm_scraper_test.olvpm")
Loading packages: 100%|██████████| 10/10 [00:01<00:00,  6.75packages/s]

Initialization of the npm package manager

# Initialize the package manager
npm_package_list = []
with open('../results/package_lists/npm_scraped.txt', 'r') as file:
    npm_package_list = file.read().splitlines()

npm_pm_scraper = PackageManager(
    data_sources=[                  # List of data sources
        NpmScraper(),
    ]
)

# Initialize the package manager
npm_pm_scraper.initialize(show_progress=True, package_names=npm_package_list[:10])

# Save the package manager
npm_pm_scraper.save(path="aux_data/npm_pm_scraper_test.olvpm")
Loading packages: 100%|██████████| 10/10 [00:02<00:00,  3.88packages/s]

And using a csv based package manager

cran_pm_csv.initialize(show_progress=True)
Loading packages: 100%|██████████| 275/275 [00:00<00:00, 675.72packages/s]

Persistence

Save the package manager

pypi_pm_scraper.save("aux_data/pypi_scraper_pm_saved.olvpm")

Load package manager from persistence file

from olivia_finder.package_manager import PackageManager
bioconductor_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/bioconductor_scraper.olvpm")
cran_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/cran_scraper.olvpm")
pypi_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/pypi_scraper.olvpm")
npm_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/npm_scraper.olvpm")

Package manager functionalities

List package names

bioconductor_pm_loaded.package_names()[300:320]
['CNVgears',
 'CONSTANd',
 'CTSV',
 'CellNOptR',
 'ChAMP',
 'ChIPseqR',
 'CiteFuse',
 'Clonality',
 'CopyNumberPlots',
 'CytoGLMM',
 'DEFormats',
 'DEScan2',
 'DEsingle',
 'DMRcaller',
 'DOSE',
 'DSS',
 'DelayedMatrixStats',
 'DirichletMultinomial',
 'EBImage',
 'EDASeq']
pypi_pm_loaded.package_names()[300:320]
['adafruit-circuitpython-bh1750',
 'adafruit-circuitpython-ble-beacon',
 'adafruit-circuitpython-ble-eddystone',
 'adafruit-circuitpython-bluefruitspi',
 'adafruit-circuitpython-bno08x',
 'adafruit-circuitpython-circuitplayground',
 'adafruit-circuitpython-debug-i2c',
 'adafruit-circuitpython-displayio-ssd1306',
 'adafruit-circuitpython-ds18x20',
 'adafruit-circuitpython-ens160',
 'adafruit-circuitpython-fingerprint',
 'adafruit-circuitpython-gc-iot-core',
 'adafruit-circuitpython-hcsr04',
 'adafruit-circuitpython-htu31d',
 'adafruit-circuitpython-imageload',
 'adafruit-circuitpython-itertools',
 'adafruit-circuitpython-lis2mdl',
 'adafruit-circuitpython-lps2x',
 'adafruit-circuitpython-lsm9ds0',
 'adafruit-circuitpython-max31855']

Obtaining package names from libraries io api is not suported

maven_pm_libio.package_names()
[]

Get the data as a dict usung datasource

maven_pm_libio.fetch_package("org.apache.commons:commons-lang3").to_dict()
{'name': 'org.apache.commons:commons-lang3',
 'version': '3.9',
 'url': 'https://repo1.maven.org/maven2/org/apache/commons/commons-lang3',
 'dependencies': [{'name': 'org.openjdk.jmh:jmh-generator-annprocess',
   'version': '1.25.2',
   'url': None,
   'dependencies': []},
  {'name': 'org.openjdk.jmh:jmh-core',
   'version': '1.25.2',
   'url': None,
   'dependencies': []},
  {'name': 'org.easymock:easymock',
   'version': '5.1.0',
   'url': None,
   'dependencies': []},
  {'name': 'org.hamcrest:hamcrest',
   'version': None,
   'url': None,
   'dependencies': []},
  {'name': 'org.junit-pioneer:junit-pioneer',
   'version': '2.0.1',
   'url': None,
   'dependencies': []},
  {'name': 'org.junit.jupiter:junit-jupiter',
   'version': '5.9.3',
   'url': None,
   'dependencies': []}]}
cran_pm_csv.get_package('nmfem').to_dict()
{'name': 'nmfem',
 'version': '1.0.4',
 'url': None,
 'dependencies': [{'name': 'rmarkdown',
   'version': None,
   'url': None,
   'dependencies': []},
  {'name': 'testthat', 'version': None, 'url': None, 'dependencies': []},
  {'name': 'knitr', 'version': None, 'url': None, 'dependencies': []},
  {'name': 'tidyr', 'version': None, 'url': None, 'dependencies': []},
  {'name': 'mixtools', 'version': None, 'url': None, 'dependencies': []},
  {'name': 'd3heatmap', 'version': None, 'url': None, 'dependencies': []},
  {'name': 'dplyr', 'version': None, 'url': None, 'dependencies': []},
  {'name': 'plyr', 'version': None, 'url': None, 'dependencies': []},
  {'name': 'R', 'version': None, 'url': None, 'dependencies': []}]}

Get a package from self data

cran_pm_loaded.get_package('A3')
<olivia_finder.package.Package at 0x7f3c3722fe20>
npm_pm_loaded.get_package("react").to_dict()
{'name': 'react',
 'version': '18.2.0',
 'url': 'https://www.npmjs.com/package/react',
 'dependencies': [{'name': 'loose-envify',
   'version': '^1.1.0',
   'url': None,
   'dependencies': []}]}

List package objects

len(npm_pm_loaded.package_names())
1919072
pypi_pm_loaded.get_packages()[300:320]
[<olivia_finder.package.Package at 0x7f3c58ea7ac0>,
 <olivia_finder.package.Package at 0x7f3c58ea7be0>,
 <olivia_finder.package.Package at 0x7f3c58ea7d00>,
 <olivia_finder.package.Package at 0x7f3c58ea7e20>,
 <olivia_finder.package.Package at 0x7f3c58ea7f40>,
 <olivia_finder.package.Package at 0x7f3c590e80a0>,
 <olivia_finder.package.Package at 0x7f3c590e81c0>,
 <olivia_finder.package.Package at 0x7f3c590e82e0>,
 <olivia_finder.package.Package at 0x7f3c590e83a0>,
 <olivia_finder.package.Package at 0x7f3c590e8520>,
 <olivia_finder.package.Package at 0x7f3c590e86a0>,
 <olivia_finder.package.Package at 0x7f3c590e87c0>,
 <olivia_finder.package.Package at 0x7f3c590e88e0>,
 <olivia_finder.package.Package at 0x7f3c590e89a0>,
 <olivia_finder.package.Package at 0x7f3c590e8b20>,
 <olivia_finder.package.Package at 0x7f3c590e8c40>,
 <olivia_finder.package.Package at 0x7f3c590e8d00>,
 <olivia_finder.package.Package at 0x7f3c590e8e80>,
 <olivia_finder.package.Package at 0x7f3c590e8fa0>,
 <olivia_finder.package.Package at 0x7f3c590e90c0>]

Obtain dependency networks

Using the data previously obtained and that are already loaded in the structure

a4_network = bioconductor_pm_loaded.fetch_adjlist("a4")
a4_network
{'a4': ['a4Base', 'a4Preproc', 'a4Classif', 'a4Core', 'a4Reporting'],
 'a4Base': ['a4Preproc',
  'a4Core',
  'methods',
  'graphics',
  'grid',
  'Biobase',
  'annaffy',
  'mpm',
  'genefilter',
  'limma',
  'multtest',
  'glmnet',
  'gplots'],
 'a4Preproc': ['BiocGenerics', 'Biobase'],
 'BiocGenerics': ['R', 'methods', 'utils', 'graphics', 'stats'],
 'R': [],
 'methods': [],
 'utils': [],
 'graphics': [],
 'stats': [],
 'Biobase': ['R', 'BiocGenerics', 'utils', 'methods'],
 'a4Core': ['Biobase', 'glmnet', 'methods', 'stats'],
 'glmnet': [],
 'grid': [],
 'annaffy': ['R',
  'methods',
  'Biobase',
  'BiocManager',
  'GO.db',
  'AnnotationDbi',
  'DBI'],
 'BiocManager': [],
 'GO.db': [],
 'AnnotationDbi': ['R',
  'methods',
  'stats4',
  'BiocGenerics',
  'Biobase',
  'IRanges',
  'DBI',
  'RSQLite',
  'S4Vectors',
  'stats',
  'KEGGREST'],
 'stats4': [],
 'IRanges': ['R',
  'methods',
  'utils',
  'stats',
  'BiocGenerics',
  'S4Vectors',
  'stats4'],
 'DBI': [],
 'RSQLite': [],
 'S4Vectors': ['R', 'methods', 'utils', 'stats', 'stats4', 'BiocGenerics'],
 'KEGGREST': ['R', 'methods', 'httr', 'png', 'Biostrings'],
 'mpm': [],
 'genefilter': ['MatrixGenerics',
  'AnnotationDbi',
  'annotate',
  'Biobase',
  'graphics',
  'methods',
  'stats',
  'survival',
  'grDevices'],
 'MatrixGenerics': ['matrixStats', 'methods'],
 'matrixStats': [],
 'annotate': ['R',
  'AnnotationDbi',
  'XML',
  'Biobase',
  'DBI',
  'xtable',
  'graphics',
  'utils',
  'stats',
  'methods',
  'BiocGenerics',
  'httr'],
 'XML': [],
 'xtable': [],
 'httr': [],
 'survival': [],
 'grDevices': [],
 'limma': ['R', 'grDevices', 'graphics', 'stats', 'utils', 'methods'],
 'multtest': ['R',
  'methods',
  'BiocGenerics',
  'Biobase',
  'survival',
  'MASS',
  'stats4'],
 'MASS': [],
 'gplots': [],
 'a4Classif': ['a4Core',
  'a4Preproc',
  'methods',
  'Biobase',
  'ROCR',
  'pamr',
  'glmnet',
  'varSelRF',
  'utils',
  'graphics',
  'stats'],
 'ROCR': [],
 'pamr': [],
 'varSelRF': [],
 'a4Reporting': ['methods', 'xtable']}

Get transitive dependency network graph

commons_lang3_network = maven_pm_libio.get_transitive_network_graph("org.apache.commons:commons-lang3", generate=True)
commons_lang3_network
<networkx.classes.digraph.DiGraph at 0x7f3c67ee3d90>
# Draw the network
from matplotlib import patches
pos = nx.spring_layout(commons_lang3_network)
nx.draw(commons_lang3_network, pos, node_size=50, font_size=8)

nx.draw_networkx_nodes(commons_lang3_network, pos, nodelist=["org.apache.commons:commons-lang3"], node_size=100, node_color="r")
plt.title("org.apache.commons:commons-lang3 transitive network", fontsize=15)
# add legend for red node
red_patch = patches.Patch(color='red', label='org.apache.commons:commons-lang3')
plt.legend(handles=[red_patch])
plt.show()

png

Obtaining updated data

a4_network2 = bioconductor_pm_loaded.get_adjlist("a4")
a4_network2
{'a4': ['a4Base', 'a4Preproc', 'a4Classif', 'a4Core', 'a4Reporting'],
 'a4Base': ['a4Preproc',
  'a4Core',
  'methods',
  'graphics',
  'grid',
  'Biobase',
  'annaffy',
  'mpm',
  'genefilter',
  'limma',
  'multtest',
  'glmnet',
  'gplots'],
 'a4Preproc': ['BiocGenerics', 'Biobase'],
 'BiocGenerics': ['R', 'methods', 'utils', 'graphics', 'stats'],
 'Biobase': ['R', 'BiocGenerics', 'utils', 'methods'],
 'a4Core': ['Biobase', 'glmnet', 'methods', 'stats'],
 'annaffy': ['R',
  'methods',
  'Biobase',
  'BiocManager',
  'GO.db',
  'AnnotationDbi',
  'DBI'],
 'AnnotationDbi': ['R',
  'methods',
  'utils',
  'stats4',
  'BiocGenerics',
  'Biobase',
  'IRanges',
  'DBI',
  'RSQLite',
  'S4Vectors',
  'stats',
  'KEGGREST'],
 'IRanges': ['R',
  'methods',
  'utils',
  'stats',
  'BiocGenerics',
  'S4Vectors',
  'stats4'],
 'S4Vectors': ['R', 'methods', 'utils', 'stats', 'stats4', 'BiocGenerics'],
 'KEGGREST': ['R', 'methods', 'httr', 'png', 'Biostrings'],
 'genefilter': ['MatrixGenerics',
  'AnnotationDbi',
  'annotate',
  'Biobase',
  'graphics',
  'methods',
  'stats',
  'survival',
  'grDevices'],
 'MatrixGenerics': ['matrixStats', 'methods'],
 'annotate': ['R',
  'AnnotationDbi',
  'XML',
  'Biobase',
  'DBI',
  'xtable',
  'graphics',
  'utils',
  'stats',
  'methods',
  'BiocGenerics',
  'httr'],
 'limma': ['R', 'grDevices', 'graphics', 'stats', 'utils', 'methods'],
 'multtest': ['R',
  'methods',
  'BiocGenerics',
  'Biobase',
  'survival',
  'MASS',
  'stats4'],
 'a4Classif': ['a4Core',
  'a4Preproc',
  'methods',
  'Biobase',
  'ROCR',
  'pamr',
  'glmnet',
  'varSelRF',
  'utils',
  'graphics',
  'stats'],
 'a4Reporting': ['methods', 'xtable']}

Note that some package managers use dependencies that are not found is their repositories, as is the case of the 'xable' package, which although it is not in bioconductor, is dependence on a bioconductor package

xtable_bioconductor = bioconductor_pm_scraper.fetch_package("xtable")
xtable_bioconductor

In concrete this package is in Cran

cran_pm = PackageManager(
    data_sources=[                  # List of data sources
        CranScraper(),
    ]
)

cran_pm.fetch_package("xtable")
<olivia_finder.package.Package at 0x7f3c2a19d090>

To solve this incongruity, we can supply the packet manager the Datasource de Cran, such as auxiliary datasource in which to perform searches if data is not found in the main datasource

bioconductor_cran_pm = PackageManager(
    # Name of the package manager
    data_sources=[                                          # List of data sources
        BioconductorScraper(),
        CranScraper(),
    ]
)

bioconductor_cran_pm.fetch_package("xtable")
<olivia_finder.package.Package at 0x7f3c2a19c910>

In this way we can obtain the network of dependencies for a package recursively, now having access to packages and dependencies that are from CRAN repository

a4_network3 = bioconductor_cran_pm.get_adjlist("a4")
a4_network3
{'a4': []}

As you can see, we can get a more complete network when we combine datasources

It is necessary that there be compatibility as in the case of Bioconductor/CRAN

a4_network.keys() == a4_network2.keys()
False
print(len(a4_network.keys()))
print(len(a4_network2.keys()))
print(len(a4_network3.keys()))
42
18
1

Export the data

bioconductor_df = bioconductor_pm_loaded.export_dataframe(full_data=False)

#Export the dataframe to a csv file
bioconductor_df.to_csv("aux_data/bioconductor_adjlist_scraping.csv", index=False)
bioconductor_df
name dependency
0 ABSSeq R
1 ABSSeq methods
2 ABSSeq locfit
3 ABSSeq limma
4 AMOUNTAIN R
... ... ...
28322 zenith reshape2
28323 zenith progress
28324 zenith utils
28325 zenith Rdpack
28326 zenith stats

28327 rows × 2 columns

pypi_df = pypi_pm_loaded.export_dataframe(full_data=True)
pypi_df
name version url dependency dependency_version dependency_url
0 0x-sra-client 4.0.0 https://pypi.org/project/0x-sra-client/ urllib3 2.0.2 https://pypi.org/project/urllib3/
1 0x-sra-client 4.0.0 https://pypi.org/project/0x-sra-client/ six 1.16.0 https://pypi.org/project/six/
2 0x-sra-client 4.0.0 https://pypi.org/project/0x-sra-client/ certifi 2022.12.7 https://pypi.org/project/certifi/
3 0x-sra-client 4.0.0 https://pypi.org/project/0x-sra-client/ python None None
4 0x-sra-client 4.0.0 https://pypi.org/project/0x-sra-client/ 0x 0.1 https://pypi.org/project/0x/
... ... ... ... ... ... ...
933950 zyfra-check 0.0.9 https://pypi.org/project/zyfra-check/ pytest 7.3.1 https://pypi.org/project/pytest/
933951 zyfra-check 0.0.9 https://pypi.org/project/zyfra-check/ jira 3.5.0 https://pypi.org/project/jira/
933952 zyfra-check 0.0.9 https://pypi.org/project/zyfra-check/ testit None None
933953 zython 0.4.1 https://pypi.org/project/zython/ wheel 0.40.0 https://pypi.org/project/wheel/
933954 zython 0.4.1 https://pypi.org/project/zython/ minizinc 0.9.0 https://pypi.org/project/minizinc/

933955 rows × 6 columns

npm_df = npm_pm_loaded.export_dataframe(full_data=True)
npm_df
name version url dependency dependency_version dependency_url
0 --hoodmane-test-pyodide 0.21.0 https://www.npmjs.com/package/--hoodmane-test-... base-64 1.0.0 https://www.npmjs.com/package/base-64
1 --hoodmane-test-pyodide 0.21.0 https://www.npmjs.com/package/--hoodmane-test-... node-fetch 3.3.1 https://www.npmjs.com/package/node-fetch
2 --hoodmane-test-pyodide 0.21.0 https://www.npmjs.com/package/--hoodmane-test-... ws 8.13.0 https://www.npmjs.com/package/ws
3 -lidonghui 1.0.0 https://www.npmjs.com/package/-lidonghui axios 1.4.0 https://www.npmjs.com/package/axios
4 -lidonghui 1.0.0 https://www.npmjs.com/package/-lidonghui commander 10.0.1 https://www.npmjs.com/package/commander
... ... ... ... ... ... ...
4855089 zzzzz-first-module 1.0.0 https://www.npmjs.com/package/zzzzz-first-module rxjs 7.8.1 https://www.npmjs.com/package/rxjs
4855090 zzzzz-first-module 1.0.0 https://www.npmjs.com/package/zzzzz-first-module zone.js 0.13.0 https://www.npmjs.com/package/zone.js
4855091 zzzzzwszzzz 1.0.0 https://www.npmjs.com/package/zzzzzwszzzz commander 10.0.1 https://www.npmjs.com/package/commander
4855092 zzzzzwszzzz 1.0.0 https://www.npmjs.com/package/zzzzzwszzzz inquirer 9.2.2 https://www.npmjs.com/package/inquirer
4855093 zzzzzwszzzz 1.0.0 https://www.npmjs.com/package/zzzzzwszzzz link 1.5.1 https://www.npmjs.com/package/link

4855094 rows × 6 columns

Get Network graph

bioconductor_G = bioconductor_pm_loaded.get_network_graph()
bioconductor_G
<networkx.classes.digraph.DiGraph at 0x7f3c229451b0>
# Draw the graph
# ----------------
# Note:
#   - Execution time can take a bit

pos = nx.spring_layout(bioconductor_G)
plt.figure(figsize=(10, 10))
nx.draw_networkx_nodes(bioconductor_G, pos, node_size=10, node_color="blue")
nx.draw_networkx_edges(bioconductor_G, pos, alpha=0.4, edge_color="black", width=0.1)
plt.title("Bioconductor network graph", fontsize=15)
plt.show()

Explore the data

We can appreciate the difference, as we explained before if we use a combined datasource

bioconductor_cran_pm = PackageManager(
    data_sources=[BioconductorScraper(), CranScraper()]
)

a4_network_2 = bioconductor_cran_pm.fetch_adjlist("a4")
import json
print(json.dumps(a4_network_2, indent=4))
{
    "a4": [
        "a4Base",
        "a4Preproc",
        "a4Classif",
        "a4Core",
        "a4Reporting"
    ],
    "a4Base": [
        "a4Preproc",
        "a4Core",
        "methods",
        "graphics",
        "grid",
        "Biobase",
        "annaffy",
        "mpm",
        "genefilter",
        "limma",
        "multtest",
        "glmnet",
        "gplots"
    ],
    "a4Preproc": [
        "BiocGenerics",
        "Biobase"
    ],
    "BiocGenerics": [
        "R",
        "methods",
        "utils",
        "graphics",
        "stats"
    ],
    "R": [],
    "methods": [],
    "utils": [],
    "graphics": [],
    "stats": [],
    "Biobase": [
        "R",
        "BiocGenerics",
        "utils",
        "methods"
    ],
    "a4Core": [
        "Biobase",
        "glmnet",
        "methods",
        "stats"
    ],
    "glmnet": [
        "R",
        "Matrix",
        "methods",
        "utils",
        "foreach",
        "shape",
        "survival",
        "Rcpp"
    ],
    "Matrix": [
        "R",
        "methods",
        "graphics",
        "grid",
        "lattice",
        "stats",
        "utils"
    ],
    "foreach": [
        "R",
        "codetools",
        "utils",
        "iterators"
    ],
    "shape": [
        "R",
        "stats",
        "graphics",
        "grDevices"
    ],
    "survival": [
        "R",
        "graphics",
        "Matrix",
        "methods",
        "splines",
        "stats",
        "utils"
    ],
    "Rcpp": [
        "methods",
        "utils"
    ],
    "grid": [],
    "annaffy": [
        "R",
        "methods",
        "Biobase",
        "BiocManager",
        "GO.db",
        "AnnotationDbi",
        "DBI"
    ],
    "BiocManager": [
        "utils"
    ],
    "GO.db": [],
    "AnnotationDbi": [
        "R",
        "methods",
        "stats4",
        "BiocGenerics",
        "Biobase",
        "IRanges",
        "DBI",
        "RSQLite",
        "S4Vectors",
        "stats",
        "KEGGREST"
    ],
    "stats4": [],
    "IRanges": [
        "R",
        "methods",
        "utils",
        "stats",
        "BiocGenerics",
        "S4Vectors",
        "stats4"
    ],
    "DBI": [
        "methods",
        "R"
    ],
    "RSQLite": [
        "R",
        "bit64",
        "blob",
        "DBI",
        "memoise",
        "methods",
        "pkgconfig"
    ],
    "S4Vectors": [
        "R",
        "methods",
        "utils",
        "stats",
        "stats4",
        "BiocGenerics"
    ],
    "KEGGREST": [
        "R",
        "methods",
        "httr",
        "png",
        "Biostrings"
    ],
    "mpm": [
        "R",
        "MASS",
        "KernSmooth"
    ],
    "MASS": [
        "R",
        "grDevices",
        "graphics",
        "stats",
        "utils",
        "methods"
    ],
    "grDevices": [],
    "KernSmooth": [
        "R",
        "stats"
    ],
    "genefilter": [
        "MatrixGenerics",
        "AnnotationDbi",
        "annotate",
        "Biobase",
        "graphics",
        "methods",
        "stats",
        "survival",
        "grDevices"
    ],
    "MatrixGenerics": [
        "matrixStats",
        "methods"
    ],
    "matrixStats": [
        "R"
    ],
    "annotate": [
        "R",
        "AnnotationDbi",
        "XML",
        "Biobase",
        "DBI",
        "xtable",
        "graphics",
        "utils",
        "stats",
        "methods",
        "BiocGenerics",
        "httr"
    ],
    "XML": [
        "R",
        "methods",
        "utils"
    ],
    "xtable": [
        "R",
        "stats",
        "utils"
    ],
    "httr": [
        "R",
        "curl",
        "jsonlite",
        "mime",
        "openssl",
        "R6"
    ],
    "limma": [
        "R",
        "grDevices",
        "graphics",
        "stats",
        "utils",
        "methods"
    ],
    "multtest": [
        "R",
        "methods",
        "BiocGenerics",
        "Biobase",
        "survival",
        "MASS",
        "stats4"
    ],
    "gplots": [
        "R",
        "gtools",
        "stats",
        "caTools",
        "KernSmooth",
        "methods"
    ],
    "gtools": [
        "methods",
        "stats",
        "utils"
    ],
    "caTools": [
        "R",
        "bitops"
    ],
    "bitops": [],
    "a4Classif": [
        "a4Core",
        "a4Preproc",
        "methods",
        "Biobase",
        "ROCR",
        "pamr",
        "glmnet",
        "varSelRF",
        "utils",
        "graphics",
        "stats"
    ],
    "ROCR": [
        "R",
        "methods",
        "graphics",
        "grDevices",
        "gplots",
        "stats"
    ],
    "pamr": [
        "R",
        "cluster",
        "survival"
    ],
    "cluster": [
        "R",
        "graphics",
        "grDevices",
        "stats",
        "utils"
    ],
    "varSelRF": [
        "R",
        "randomForest",
        "parallel"
    ],
    "randomForest": [
        "R",
        "stats"
    ],
    "parallel": [],
    "a4Reporting": [
        "methods",
        "xtable"
    ]
}
   1'''
   2
   3## Initialize a package manager
   4
   5
   6**Note:**
   7
   8Initialization based on a scraper-type datasource involves initializing the data prior to its use.
   9
  10Initialization based on a CSV-type datasource already contains all the data and can be retrieved directly.
  11
  12Loading from a persistence file implies that the file contains an object that has already been initialized or already contains data.
  13
  14A bioconductor scraping based package manager
  15
  16
  17```python
  18from olivia_finder.package_manager import PackageManager
  19```
  20
  21
  22```python
  23bioconductor_pm_scraper = PackageManager(
  24    data_sources=[                  # List of data sources
  25        BioconductorScraper(),
  26    ]
  27)
  28```
  29
  30A cran package manager loaded from a csv file
  31
  32
  33```python
  34cran_pm_csv = PackageManager(
  35    data_sources=[                  # List of data sources
  36        CSVDataSource(
  37            # Path to the CSV file
  38            "aux_data/cran_adjlist_test.csv",
  39            dependent_field="Project Name",
  40            dependency_field="Dependency Name",
  41            dependent_version_field="Version Number",
  42        )
  43    ]
  44)
  45
  46# Is needed to initialize the package manager to fill the package list with the csv data
  47cran_pm_csv.initialize(show_progress=True)
  48```
  49
  50    Loading packages: 100%|██████████| 275/275 [00:00<00:00, 729.91packages/s]
  51
  52
  53A pypi package manager loaded from persistence file
  54
  55
  56```python
  57bioconductor_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/bioconductor_scraper.olvpm")
  58```
  59
  60A Maven package manager loaded from librariesio api
  61
  62
  63```python
  64maven_pm_libio = PackageManager(
  65    data_sources=[                  # List of data sources
  66        LibrariesioDataSource(platform="maven")
  67    ]
  68)
  69```
  70
  71**For scraping-based datasources: Initialize the structure with the data of the selected sources**
  72
  73
  74<span style="color:red">Note:</span>
  75
  76The automatic obtaining of bioconductor packages as mentioned above depends on Selenium, which requires a pre-installed browser in the system, in our case Firefox.
  77
  78It is possible that if you are running this notebook from a third-party Jupyter server, do not have a browser available
  79
  80As a solution to this problem it is proposed to use the package_names parameter, in this way we can add a list of packages and the process can be continued
  81
  82
  83```python
  84# bioconductor_pm_scraper.initialize(show_progress=True)
  85```
  86
  87Note: If we do not provide a list of packages it will be obtained automatically if that functionality is implemented in datasource
  88
  89Initialization of the bioconductor package manager using package list
  90
  91
  92```python
  93# Initialize the package list
  94bioconductor_package_list = []
  95with open('../results/package_lists/bioconductor_scraped.txt', 'r') as file:
  96    bioconductor_package_list = file.read().splitlines()
  97
  98# Initialize the package manager with the package list
  99bioconductor_pm_scraper.initialize(show_progress=True, package_names=bioconductor_package_list[:10])
 100```
 101
 102    Loading packages: 100%|██████████| 10/10 [00:06<00:00,  1.43packages/s]
 103
 104
 105Initialization of the Pypi package manager
 106
 107
 108
 109```python
 110pypi_pm_scraper = PackageManager(
 111    data_sources=[                  # List of data sources
 112        PypiScraper(),
 113    ]
 114)
 115
 116pypi_package_list = []
 117with open('../results/package_lists/pypi_scraped.txt', 'r') as file:
 118    pypi_package_list = file.read().splitlines()
 119
 120# Initialize the package manager
 121pypi_pm_scraper.initialize(show_progress=True, package_names=pypi_package_list[:10])
 122
 123# Save the package manager
 124pypi_pm_scraper.save(path="aux_data/pypi_pm_scraper_test.olvpm")
 125```
 126
 127    Loading packages: 100%|██████████| 10/10 [00:01<00:00,  6.75packages/s]
 128
 129
 130Initialization of the npm package manager
 131
 132
 133```python
 134# Initialize the package manager
 135npm_package_list = []
 136with open('../results/package_lists/npm_scraped.txt', 'r') as file:
 137    npm_package_list = file.read().splitlines()
 138
 139npm_pm_scraper = PackageManager(
 140    data_sources=[                  # List of data sources
 141        NpmScraper(),
 142    ]
 143)
 144
 145# Initialize the package manager
 146npm_pm_scraper.initialize(show_progress=True, package_names=npm_package_list[:10])
 147
 148# Save the package manager
 149npm_pm_scraper.save(path="aux_data/npm_pm_scraper_test.olvpm")
 150```
 151
 152    Loading packages: 100%|██████████| 10/10 [00:02<00:00,  3.88packages/s]
 153
 154
 155And using a csv based package manager
 156
 157
 158```python
 159cran_pm_csv.initialize(show_progress=True)
 160```
 161
 162    Loading packages: 100%|██████████| 275/275 [00:00<00:00, 675.72packages/s]
 163
 164
 165## Persistence
 166
 167
 168**Save the package manager**
 169
 170
 171
 172```python
 173pypi_pm_scraper.save("aux_data/pypi_scraper_pm_saved.olvpm")
 174```
 175
 176**Load package manager from persistence file**
 177
 178
 179
 180```python
 181from olivia_finder.package_manager import PackageManager
 182```
 183
 184
 185```python
 186bioconductor_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/bioconductor_scraper.olvpm")
 187```
 188
 189
 190```python
 191cran_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/cran_scraper.olvpm")
 192```
 193
 194
 195```python
 196pypi_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/pypi_scraper.olvpm")
 197```
 198
 199
 200```python
 201npm_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/npm_scraper.olvpm")
 202```
 203
 204## Package manager functionalities
 205
 206
 207**List package names**
 208
 209
 210
 211```python
 212bioconductor_pm_loaded.package_names()[300:320]
 213```
 214
 215
 216    ['CNVgears',
 217     'CONSTANd',
 218     'CTSV',
 219     'CellNOptR',
 220     'ChAMP',
 221     'ChIPseqR',
 222     'CiteFuse',
 223     'Clonality',
 224     'CopyNumberPlots',
 225     'CytoGLMM',
 226     'DEFormats',
 227     'DEScan2',
 228     'DEsingle',
 229     'DMRcaller',
 230     'DOSE',
 231     'DSS',
 232     'DelayedMatrixStats',
 233     'DirichletMultinomial',
 234     'EBImage',
 235     'EDASeq']
 236
 237
 238
 239
 240```python
 241pypi_pm_loaded.package_names()[300:320]
 242```
 243
 244
 245
 246
 247    ['adafruit-circuitpython-bh1750',
 248     'adafruit-circuitpython-ble-beacon',
 249     'adafruit-circuitpython-ble-eddystone',
 250     'adafruit-circuitpython-bluefruitspi',
 251     'adafruit-circuitpython-bno08x',
 252     'adafruit-circuitpython-circuitplayground',
 253     'adafruit-circuitpython-debug-i2c',
 254     'adafruit-circuitpython-displayio-ssd1306',
 255     'adafruit-circuitpython-ds18x20',
 256     'adafruit-circuitpython-ens160',
 257     'adafruit-circuitpython-fingerprint',
 258     'adafruit-circuitpython-gc-iot-core',
 259     'adafruit-circuitpython-hcsr04',
 260     'adafruit-circuitpython-htu31d',
 261     'adafruit-circuitpython-imageload',
 262     'adafruit-circuitpython-itertools',
 263     'adafruit-circuitpython-lis2mdl',
 264     'adafruit-circuitpython-lps2x',
 265     'adafruit-circuitpython-lsm9ds0',
 266     'adafruit-circuitpython-max31855']
 267
 268
 269
 270<span style="color: red"> Obtaining package names from libraries io api is not suported</span>
 271
 272
 273```python
 274maven_pm_libio.package_names()
 275```
 276
 277
 278
 279
 280    []
 281
 282
 283
 284**Get the data as a dict usung datasource**
 285
 286
 287```python
 288maven_pm_libio.fetch_package("org.apache.commons:commons-lang3").to_dict()
 289```
 290
 291
 292
 293
 294    {'name': 'org.apache.commons:commons-lang3',
 295     'version': '3.9',
 296     'url': 'https://repo1.maven.org/maven2/org/apache/commons/commons-lang3',
 297     'dependencies': [{'name': 'org.openjdk.jmh:jmh-generator-annprocess',
 298       'version': '1.25.2',
 299       'url': None,
 300       'dependencies': []},
 301      {'name': 'org.openjdk.jmh:jmh-core',
 302       'version': '1.25.2',
 303       'url': None,
 304       'dependencies': []},
 305      {'name': 'org.easymock:easymock',
 306       'version': '5.1.0',
 307       'url': None,
 308       'dependencies': []},
 309      {'name': 'org.hamcrest:hamcrest',
 310       'version': None,
 311       'url': None,
 312       'dependencies': []},
 313      {'name': 'org.junit-pioneer:junit-pioneer',
 314       'version': '2.0.1',
 315       'url': None,
 316       'dependencies': []},
 317      {'name': 'org.junit.jupiter:junit-jupiter',
 318       'version': '5.9.3',
 319       'url': None,
 320       'dependencies': []}]}
 321
 322
 323
 324
 325```python
 326cran_pm_csv.get_package('nmfem').to_dict()
 327```
 328
 329
 330
 331
 332    {'name': 'nmfem',
 333     'version': '1.0.4',
 334     'url': None,
 335     'dependencies': [{'name': 'rmarkdown',
 336       'version': None,
 337       'url': None,
 338       'dependencies': []},
 339      {'name': 'testthat', 'version': None, 'url': None, 'dependencies': []},
 340      {'name': 'knitr', 'version': None, 'url': None, 'dependencies': []},
 341      {'name': 'tidyr', 'version': None, 'url': None, 'dependencies': []},
 342      {'name': 'mixtools', 'version': None, 'url': None, 'dependencies': []},
 343      {'name': 'd3heatmap', 'version': None, 'url': None, 'dependencies': []},
 344      {'name': 'dplyr', 'version': None, 'url': None, 'dependencies': []},
 345      {'name': 'plyr', 'version': None, 'url': None, 'dependencies': []},
 346      {'name': 'R', 'version': None, 'url': None, 'dependencies': []}]}
 347
 348
 349
 350**Get a package from self data**
 351
 352
 353```python
 354cran_pm_loaded.get_package('A3')
 355```
 356
 357
 358
 359
 360    <olivia_finder.package.Package at 0x7f3c3722fe20>
 361
 362
 363
 364
 365```python
 366npm_pm_loaded.get_package("react").to_dict()
 367```
 368
 369
 370
 371
 372    {'name': 'react',
 373     'version': '18.2.0',
 374     'url': 'https://www.npmjs.com/package/react',
 375     'dependencies': [{'name': 'loose-envify',
 376       'version': '^1.1.0',
 377       'url': None,
 378       'dependencies': []}]}
 379
 380
 381
 382**List package objects**
 383
 384
 385
 386```python
 387len(npm_pm_loaded.package_names())
 388```
 389
 390
 391
 392
 393    1919072
 394
 395
 396
 397
 398```python
 399pypi_pm_loaded.get_packages()[300:320]
 400```
 401
 402
 403
 404
 405    [<olivia_finder.package.Package at 0x7f3c58ea7ac0>,
 406     <olivia_finder.package.Package at 0x7f3c58ea7be0>,
 407     <olivia_finder.package.Package at 0x7f3c58ea7d00>,
 408     <olivia_finder.package.Package at 0x7f3c58ea7e20>,
 409     <olivia_finder.package.Package at 0x7f3c58ea7f40>,
 410     <olivia_finder.package.Package at 0x7f3c590e80a0>,
 411     <olivia_finder.package.Package at 0x7f3c590e81c0>,
 412     <olivia_finder.package.Package at 0x7f3c590e82e0>,
 413     <olivia_finder.package.Package at 0x7f3c590e83a0>,
 414     <olivia_finder.package.Package at 0x7f3c590e8520>,
 415     <olivia_finder.package.Package at 0x7f3c590e86a0>,
 416     <olivia_finder.package.Package at 0x7f3c590e87c0>,
 417     <olivia_finder.package.Package at 0x7f3c590e88e0>,
 418     <olivia_finder.package.Package at 0x7f3c590e89a0>,
 419     <olivia_finder.package.Package at 0x7f3c590e8b20>,
 420     <olivia_finder.package.Package at 0x7f3c590e8c40>,
 421     <olivia_finder.package.Package at 0x7f3c590e8d00>,
 422     <olivia_finder.package.Package at 0x7f3c590e8e80>,
 423     <olivia_finder.package.Package at 0x7f3c590e8fa0>,
 424     <olivia_finder.package.Package at 0x7f3c590e90c0>]
 425
 426
 427
 428**Obtain dependency networks**
 429
 430Using the data previously obtained and that are already loaded in the structure
 431
 432
 433```python
 434a4_network = bioconductor_pm_loaded.fetch_adjlist("a4")
 435a4_network
 436```
 437
 438
 439
 440
 441    {'a4': ['a4Base', 'a4Preproc', 'a4Classif', 'a4Core', 'a4Reporting'],
 442     'a4Base': ['a4Preproc',
 443      'a4Core',
 444      'methods',
 445      'graphics',
 446      'grid',
 447      'Biobase',
 448      'annaffy',
 449      'mpm',
 450      'genefilter',
 451      'limma',
 452      'multtest',
 453      'glmnet',
 454      'gplots'],
 455     'a4Preproc': ['BiocGenerics', 'Biobase'],
 456     'BiocGenerics': ['R', 'methods', 'utils', 'graphics', 'stats'],
 457     'R': [],
 458     'methods': [],
 459     'utils': [],
 460     'graphics': [],
 461     'stats': [],
 462     'Biobase': ['R', 'BiocGenerics', 'utils', 'methods'],
 463     'a4Core': ['Biobase', 'glmnet', 'methods', 'stats'],
 464     'glmnet': [],
 465     'grid': [],
 466     'annaffy': ['R',
 467      'methods',
 468      'Biobase',
 469      'BiocManager',
 470      'GO.db',
 471      'AnnotationDbi',
 472      'DBI'],
 473     'BiocManager': [],
 474     'GO.db': [],
 475     'AnnotationDbi': ['R',
 476      'methods',
 477      'stats4',
 478      'BiocGenerics',
 479      'Biobase',
 480      'IRanges',
 481      'DBI',
 482      'RSQLite',
 483      'S4Vectors',
 484      'stats',
 485      'KEGGREST'],
 486     'stats4': [],
 487     'IRanges': ['R',
 488      'methods',
 489      'utils',
 490      'stats',
 491      'BiocGenerics',
 492      'S4Vectors',
 493      'stats4'],
 494     'DBI': [],
 495     'RSQLite': [],
 496     'S4Vectors': ['R', 'methods', 'utils', 'stats', 'stats4', 'BiocGenerics'],
 497     'KEGGREST': ['R', 'methods', 'httr', 'png', 'Biostrings'],
 498     'mpm': [],
 499     'genefilter': ['MatrixGenerics',
 500      'AnnotationDbi',
 501      'annotate',
 502      'Biobase',
 503      'graphics',
 504      'methods',
 505      'stats',
 506      'survival',
 507      'grDevices'],
 508     'MatrixGenerics': ['matrixStats', 'methods'],
 509     'matrixStats': [],
 510     'annotate': ['R',
 511      'AnnotationDbi',
 512      'XML',
 513      'Biobase',
 514      'DBI',
 515      'xtable',
 516      'graphics',
 517      'utils',
 518      'stats',
 519      'methods',
 520      'BiocGenerics',
 521      'httr'],
 522     'XML': [],
 523     'xtable': [],
 524     'httr': [],
 525     'survival': [],
 526     'grDevices': [],
 527     'limma': ['R', 'grDevices', 'graphics', 'stats', 'utils', 'methods'],
 528     'multtest': ['R',
 529      'methods',
 530      'BiocGenerics',
 531      'Biobase',
 532      'survival',
 533      'MASS',
 534      'stats4'],
 535     'MASS': [],
 536     'gplots': [],
 537     'a4Classif': ['a4Core',
 538      'a4Preproc',
 539      'methods',
 540      'Biobase',
 541      'ROCR',
 542      'pamr',
 543      'glmnet',
 544      'varSelRF',
 545      'utils',
 546      'graphics',
 547      'stats'],
 548     'ROCR': [],
 549     'pamr': [],
 550     'varSelRF': [],
 551     'a4Reporting': ['methods', 'xtable']}
 552
 553
 554
 555Get transitive dependency network graph
 556
 557
 558```python
 559commons_lang3_network = maven_pm_libio.get_transitive_network_graph("org.apache.commons:commons-lang3", generate=True)
 560commons_lang3_network
 561```
 562
 563
 564
 565
 566    <networkx.classes.digraph.DiGraph at 0x7f3c67ee3d90>
 567
 568
 569
 570
 571```python
 572# Draw the network
 573from matplotlib import patches
 574pos = nx.spring_layout(commons_lang3_network)
 575nx.draw(commons_lang3_network, pos, node_size=50, font_size=8)
 576
 577nx.draw_networkx_nodes(commons_lang3_network, pos, nodelist=["org.apache.commons:commons-lang3"], node_size=100, node_color="r")
 578plt.title("org.apache.commons:commons-lang3 transitive network", fontsize=15)
 579# add legend for red node
 580red_patch = patches.Patch(color='red', label='org.apache.commons:commons-lang3')
 581plt.legend(handles=[red_patch])
 582plt.show()
 583```
 584
 585
 586    
 587![png](Olivia%20Finder%20-%20Implementation_files/Olivia%20Finder%20-%20Implementation_229_0.png)
 588    
 589
 590
 591**Obtaining updated data**
 592
 593
 594```python
 595a4_network2 = bioconductor_pm_loaded.get_adjlist("a4")
 596a4_network2
 597```
 598
 599
 600
 601
 602    {'a4': ['a4Base', 'a4Preproc', 'a4Classif', 'a4Core', 'a4Reporting'],
 603     'a4Base': ['a4Preproc',
 604      'a4Core',
 605      'methods',
 606      'graphics',
 607      'grid',
 608      'Biobase',
 609      'annaffy',
 610      'mpm',
 611      'genefilter',
 612      'limma',
 613      'multtest',
 614      'glmnet',
 615      'gplots'],
 616     'a4Preproc': ['BiocGenerics', 'Biobase'],
 617     'BiocGenerics': ['R', 'methods', 'utils', 'graphics', 'stats'],
 618     'Biobase': ['R', 'BiocGenerics', 'utils', 'methods'],
 619     'a4Core': ['Biobase', 'glmnet', 'methods', 'stats'],
 620     'annaffy': ['R',
 621      'methods',
 622      'Biobase',
 623      'BiocManager',
 624      'GO.db',
 625      'AnnotationDbi',
 626      'DBI'],
 627     'AnnotationDbi': ['R',
 628      'methods',
 629      'utils',
 630      'stats4',
 631      'BiocGenerics',
 632      'Biobase',
 633      'IRanges',
 634      'DBI',
 635      'RSQLite',
 636      'S4Vectors',
 637      'stats',
 638      'KEGGREST'],
 639     'IRanges': ['R',
 640      'methods',
 641      'utils',
 642      'stats',
 643      'BiocGenerics',
 644      'S4Vectors',
 645      'stats4'],
 646     'S4Vectors': ['R', 'methods', 'utils', 'stats', 'stats4', 'BiocGenerics'],
 647     'KEGGREST': ['R', 'methods', 'httr', 'png', 'Biostrings'],
 648     'genefilter': ['MatrixGenerics',
 649      'AnnotationDbi',
 650      'annotate',
 651      'Biobase',
 652      'graphics',
 653      'methods',
 654      'stats',
 655      'survival',
 656      'grDevices'],
 657     'MatrixGenerics': ['matrixStats', 'methods'],
 658     'annotate': ['R',
 659      'AnnotationDbi',
 660      'XML',
 661      'Biobase',
 662      'DBI',
 663      'xtable',
 664      'graphics',
 665      'utils',
 666      'stats',
 667      'methods',
 668      'BiocGenerics',
 669      'httr'],
 670     'limma': ['R', 'grDevices', 'graphics', 'stats', 'utils', 'methods'],
 671     'multtest': ['R',
 672      'methods',
 673      'BiocGenerics',
 674      'Biobase',
 675      'survival',
 676      'MASS',
 677      'stats4'],
 678     'a4Classif': ['a4Core',
 679      'a4Preproc',
 680      'methods',
 681      'Biobase',
 682      'ROCR',
 683      'pamr',
 684      'glmnet',
 685      'varSelRF',
 686      'utils',
 687      'graphics',
 688      'stats'],
 689     'a4Reporting': ['methods', 'xtable']}
 690
 691
 692
 693Note that some package managers use dependencies that are not found is their repositories, as is the case of the 'xable' package, which although it is not in bioconductor, is dependence on a bioconductor package
 694
 695
 696```python
 697xtable_bioconductor = bioconductor_pm_scraper.fetch_package("xtable")
 698xtable_bioconductor
 699```
 700
 701In concrete this package is in Cran
 702
 703
 704```python
 705cran_pm = PackageManager(
 706    data_sources=[                  # List of data sources
 707        CranScraper(),
 708    ]
 709)
 710
 711cran_pm.fetch_package("xtable")
 712```
 713
 714
 715
 716
 717    <olivia_finder.package.Package at 0x7f3c2a19d090>
 718
 719
 720
 721To solve this incongruity, we can supply the packet manager the Datasource de Cran, such as auxiliary datasource in which to perform searches if data is not found in the main datasource
 722
 723
 724```python
 725bioconductor_cran_pm = PackageManager(
 726    # Name of the package manager
 727    data_sources=[                                          # List of data sources
 728        BioconductorScraper(),
 729        CranScraper(),
 730    ]
 731)
 732
 733bioconductor_cran_pm.fetch_package("xtable")
 734```
 735
 736
 737
 738
 739    <olivia_finder.package.Package at 0x7f3c2a19c910>
 740
 741
 742
 743In this way we can obtain the network of dependencies for a package recursively, now having access to packages and dependencies that are from CRAN repository
 744
 745
 746```python
 747a4_network3 = bioconductor_cran_pm.get_adjlist("a4")
 748a4_network3
 749```
 750
 751
 752
 753
 754    {'a4': []}
 755
 756
 757
 758As you can see, we can get a more complete network when we combine datasources
 759
 760It is necessary that there be compatibility as in the case of Bioconductor/CRAN
 761
 762
 763```python
 764a4_network.keys() == a4_network2.keys()
 765```
 766
 767
 768
 769
 770    False
 771
 772
 773
 774
 775```python
 776print(len(a4_network.keys()))
 777print(len(a4_network2.keys()))
 778print(len(a4_network3.keys()))
 779```
 780
 781    42
 782    18
 783    1
 784
 785
 786## Export the data
 787
 788
 789```python
 790bioconductor_df = bioconductor_pm_loaded.export_dataframe(full_data=False)
 791
 792#Export the dataframe to a csv file
 793bioconductor_df.to_csv("aux_data/bioconductor_adjlist_scraping.csv", index=False)
 794bioconductor_df
 795```
 796
 797
 798
 799
 800<div>
 801<style scoped>
 802    .dataframe tbody tr th:only-of-type {
 803        vertical-align: middle;
 804    }
 805
 806    .dataframe tbody tr th {
 807        vertical-align: top;
 808    }
 809
 810    .dataframe thead th {
 811        text-align: right;
 812    }
 813</style>
 814<table border="1" class="dataframe">
 815  <thead>
 816    <tr style="text-align: right;">
 817      <th></th>
 818      <th>name</th>
 819      <th>dependency</th>
 820    </tr>
 821  </thead>
 822  <tbody>
 823    <tr>
 824      <th>0</th>
 825      <td>ABSSeq</td>
 826      <td>R</td>
 827    </tr>
 828    <tr>
 829      <th>1</th>
 830      <td>ABSSeq</td>
 831      <td>methods</td>
 832    </tr>
 833    <tr>
 834      <th>2</th>
 835      <td>ABSSeq</td>
 836      <td>locfit</td>
 837    </tr>
 838    <tr>
 839      <th>3</th>
 840      <td>ABSSeq</td>
 841      <td>limma</td>
 842    </tr>
 843    <tr>
 844      <th>4</th>
 845      <td>AMOUNTAIN</td>
 846      <td>R</td>
 847    </tr>
 848    <tr>
 849      <th>...</th>
 850      <td>...</td>
 851      <td>...</td>
 852    </tr>
 853    <tr>
 854      <th>28322</th>
 855      <td>zenith</td>
 856      <td>reshape2</td>
 857    </tr>
 858    <tr>
 859      <th>28323</th>
 860      <td>zenith</td>
 861      <td>progress</td>
 862    </tr>
 863    <tr>
 864      <th>28324</th>
 865      <td>zenith</td>
 866      <td>utils</td>
 867    </tr>
 868    <tr>
 869      <th>28325</th>
 870      <td>zenith</td>
 871      <td>Rdpack</td>
 872    </tr>
 873    <tr>
 874      <th>28326</th>
 875      <td>zenith</td>
 876      <td>stats</td>
 877    </tr>
 878  </tbody>
 879</table>
 880<p>28327 rows × 2 columns</p>
 881</div>
 882
 883
 884
 885
 886```python
 887pypi_df = pypi_pm_loaded.export_dataframe(full_data=True)
 888pypi_df
 889```
 890
 891
 892
 893
 894<div>
 895<style scoped>
 896    .dataframe tbody tr th:only-of-type {
 897        vertical-align: middle;
 898    }
 899
 900    .dataframe tbody tr th {
 901        vertical-align: top;
 902    }
 903
 904    .dataframe thead th {
 905        text-align: right;
 906    }
 907</style>
 908<table border="1" class="dataframe">
 909  <thead>
 910    <tr style="text-align: right;">
 911      <th></th>
 912      <th>name</th>
 913      <th>version</th>
 914      <th>url</th>
 915      <th>dependency</th>
 916      <th>dependency_version</th>
 917      <th>dependency_url</th>
 918    </tr>
 919  </thead>
 920  <tbody>
 921    <tr>
 922      <th>0</th>
 923      <td>0x-sra-client</td>
 924      <td>4.0.0</td>
 925      <td>https://pypi.org/project/0x-sra-client/</td>
 926      <td>urllib3</td>
 927      <td>2.0.2</td>
 928      <td>https://pypi.org/project/urllib3/</td>
 929    </tr>
 930    <tr>
 931      <th>1</th>
 932      <td>0x-sra-client</td>
 933      <td>4.0.0</td>
 934      <td>https://pypi.org/project/0x-sra-client/</td>
 935      <td>six</td>
 936      <td>1.16.0</td>
 937      <td>https://pypi.org/project/six/</td>
 938    </tr>
 939    <tr>
 940      <th>2</th>
 941      <td>0x-sra-client</td>
 942      <td>4.0.0</td>
 943      <td>https://pypi.org/project/0x-sra-client/</td>
 944      <td>certifi</td>
 945      <td>2022.12.7</td>
 946      <td>https://pypi.org/project/certifi/</td>
 947    </tr>
 948    <tr>
 949      <th>3</th>
 950      <td>0x-sra-client</td>
 951      <td>4.0.0</td>
 952      <td>https://pypi.org/project/0x-sra-client/</td>
 953      <td>python</td>
 954      <td>None</td>
 955      <td>None</td>
 956    </tr>
 957    <tr>
 958      <th>4</th>
 959      <td>0x-sra-client</td>
 960      <td>4.0.0</td>
 961      <td>https://pypi.org/project/0x-sra-client/</td>
 962      <td>0x</td>
 963      <td>0.1</td>
 964      <td>https://pypi.org/project/0x/</td>
 965    </tr>
 966    <tr>
 967      <th>...</th>
 968      <td>...</td>
 969      <td>...</td>
 970      <td>...</td>
 971      <td>...</td>
 972      <td>...</td>
 973      <td>...</td>
 974    </tr>
 975    <tr>
 976      <th>933950</th>
 977      <td>zyfra-check</td>
 978      <td>0.0.9</td>
 979      <td>https://pypi.org/project/zyfra-check/</td>
 980      <td>pytest</td>
 981      <td>7.3.1</td>
 982      <td>https://pypi.org/project/pytest/</td>
 983    </tr>
 984    <tr>
 985      <th>933951</th>
 986      <td>zyfra-check</td>
 987      <td>0.0.9</td>
 988      <td>https://pypi.org/project/zyfra-check/</td>
 989      <td>jira</td>
 990      <td>3.5.0</td>
 991      <td>https://pypi.org/project/jira/</td>
 992    </tr>
 993    <tr>
 994      <th>933952</th>
 995      <td>zyfra-check</td>
 996      <td>0.0.9</td>
 997      <td>https://pypi.org/project/zyfra-check/</td>
 998      <td>testit</td>
 999      <td>None</td>
1000      <td>None</td>
1001    </tr>
1002    <tr>
1003      <th>933953</th>
1004      <td>zython</td>
1005      <td>0.4.1</td>
1006      <td>https://pypi.org/project/zython/</td>
1007      <td>wheel</td>
1008      <td>0.40.0</td>
1009      <td>https://pypi.org/project/wheel/</td>
1010    </tr>
1011    <tr>
1012      <th>933954</th>
1013      <td>zython</td>
1014      <td>0.4.1</td>
1015      <td>https://pypi.org/project/zython/</td>
1016      <td>minizinc</td>
1017      <td>0.9.0</td>
1018      <td>https://pypi.org/project/minizinc/</td>
1019    </tr>
1020  </tbody>
1021</table>
1022<p>933955 rows × 6 columns</p>
1023</div>
1024
1025
1026
1027
1028```python
1029npm_df = npm_pm_loaded.export_dataframe(full_data=True)
1030npm_df
1031```
1032
1033
1034
1035
1036<div>
1037<style scoped>
1038    .dataframe tbody tr th:only-of-type {
1039        vertical-align: middle;
1040    }
1041
1042    .dataframe tbody tr th {
1043        vertical-align: top;
1044    }
1045
1046    .dataframe thead th {
1047        text-align: right;
1048    }
1049</style>
1050<table border="1" class="dataframe">
1051  <thead>
1052    <tr style="text-align: right;">
1053      <th></th>
1054      <th>name</th>
1055      <th>version</th>
1056      <th>url</th>
1057      <th>dependency</th>
1058      <th>dependency_version</th>
1059      <th>dependency_url</th>
1060    </tr>
1061  </thead>
1062  <tbody>
1063    <tr>
1064      <th>0</th>
1065      <td>--hoodmane-test-pyodide</td>
1066      <td>0.21.0</td>
1067      <td>https://www.npmjs.com/package/--hoodmane-test-...</td>
1068      <td>base-64</td>
1069      <td>1.0.0</td>
1070      <td>https://www.npmjs.com/package/base-64</td>
1071    </tr>
1072    <tr>
1073      <th>1</th>
1074      <td>--hoodmane-test-pyodide</td>
1075      <td>0.21.0</td>
1076      <td>https://www.npmjs.com/package/--hoodmane-test-...</td>
1077      <td>node-fetch</td>
1078      <td>3.3.1</td>
1079      <td>https://www.npmjs.com/package/node-fetch</td>
1080    </tr>
1081    <tr>
1082      <th>2</th>
1083      <td>--hoodmane-test-pyodide</td>
1084      <td>0.21.0</td>
1085      <td>https://www.npmjs.com/package/--hoodmane-test-...</td>
1086      <td>ws</td>
1087      <td>8.13.0</td>
1088      <td>https://www.npmjs.com/package/ws</td>
1089    </tr>
1090    <tr>
1091      <th>3</th>
1092      <td>-lidonghui</td>
1093      <td>1.0.0</td>
1094      <td>https://www.npmjs.com/package/-lidonghui</td>
1095      <td>axios</td>
1096      <td>1.4.0</td>
1097      <td>https://www.npmjs.com/package/axios</td>
1098    </tr>
1099    <tr>
1100      <th>4</th>
1101      <td>-lidonghui</td>
1102      <td>1.0.0</td>
1103      <td>https://www.npmjs.com/package/-lidonghui</td>
1104      <td>commander</td>
1105      <td>10.0.1</td>
1106      <td>https://www.npmjs.com/package/commander</td>
1107    </tr>
1108    <tr>
1109      <th>...</th>
1110      <td>...</td>
1111      <td>...</td>
1112      <td>...</td>
1113      <td>...</td>
1114      <td>...</td>
1115      <td>...</td>
1116    </tr>
1117    <tr>
1118      <th>4855089</th>
1119      <td>zzzzz-first-module</td>
1120      <td>1.0.0</td>
1121      <td>https://www.npmjs.com/package/zzzzz-first-module</td>
1122      <td>rxjs</td>
1123      <td>7.8.1</td>
1124      <td>https://www.npmjs.com/package/rxjs</td>
1125    </tr>
1126    <tr>
1127      <th>4855090</th>
1128      <td>zzzzz-first-module</td>
1129      <td>1.0.0</td>
1130      <td>https://www.npmjs.com/package/zzzzz-first-module</td>
1131      <td>zone.js</td>
1132      <td>0.13.0</td>
1133      <td>https://www.npmjs.com/package/zone.js</td>
1134    </tr>
1135    <tr>
1136      <th>4855091</th>
1137      <td>zzzzzwszzzz</td>
1138      <td>1.0.0</td>
1139      <td>https://www.npmjs.com/package/zzzzzwszzzz</td>
1140      <td>commander</td>
1141      <td>10.0.1</td>
1142      <td>https://www.npmjs.com/package/commander</td>
1143    </tr>
1144    <tr>
1145      <th>4855092</th>
1146      <td>zzzzzwszzzz</td>
1147      <td>1.0.0</td>
1148      <td>https://www.npmjs.com/package/zzzzzwszzzz</td>
1149      <td>inquirer</td>
1150      <td>9.2.2</td>
1151      <td>https://www.npmjs.com/package/inquirer</td>
1152    </tr>
1153    <tr>
1154      <th>4855093</th>
1155      <td>zzzzzwszzzz</td>
1156      <td>1.0.0</td>
1157      <td>https://www.npmjs.com/package/zzzzzwszzzz</td>
1158      <td>link</td>
1159      <td>1.5.1</td>
1160      <td>https://www.npmjs.com/package/link</td>
1161    </tr>
1162  </tbody>
1163</table>
1164<p>4855094 rows × 6 columns</p>
1165</div>
1166
1167
1168
1169**Get Network graph**
1170
1171
1172```python
1173bioconductor_G = bioconductor_pm_loaded.get_network_graph()
1174bioconductor_G
1175```
1176
1177
1178
1179
1180    <networkx.classes.digraph.DiGraph at 0x7f3c229451b0>
1181
1182
1183
1184
1185```python
1186# Draw the graph
1187# ----------------
1188# Note:
1189#   - Execution time can take a bit
1190
1191pos = nx.spring_layout(bioconductor_G)
1192plt.figure(figsize=(10, 10))
1193nx.draw_networkx_nodes(bioconductor_G, pos, node_size=10, node_color="blue")
1194nx.draw_networkx_edges(bioconductor_G, pos, alpha=0.4, edge_color="black", width=0.1)
1195plt.title("Bioconductor network graph", fontsize=15)
1196plt.show()
1197```
1198  
1199
1200
1201## Explore the data 
1202
1203
1204We can appreciate the difference, as we explained before if we use a combined datasource
1205
1206
1207```python
1208bioconductor_cran_pm = PackageManager(
1209    data_sources=[BioconductorScraper(), CranScraper()]
1210)
1211
1212a4_network_2 = bioconductor_cran_pm.fetch_adjlist("a4")
1213```
1214
1215
1216```python
1217import json
1218print(json.dumps(a4_network_2, indent=4))
1219```
1220
1221    {
1222        "a4": [
1223            "a4Base",
1224            "a4Preproc",
1225            "a4Classif",
1226            "a4Core",
1227            "a4Reporting"
1228        ],
1229        "a4Base": [
1230            "a4Preproc",
1231            "a4Core",
1232            "methods",
1233            "graphics",
1234            "grid",
1235            "Biobase",
1236            "annaffy",
1237            "mpm",
1238            "genefilter",
1239            "limma",
1240            "multtest",
1241            "glmnet",
1242            "gplots"
1243        ],
1244        "a4Preproc": [
1245            "BiocGenerics",
1246            "Biobase"
1247        ],
1248        "BiocGenerics": [
1249            "R",
1250            "methods",
1251            "utils",
1252            "graphics",
1253            "stats"
1254        ],
1255        "R": [],
1256        "methods": [],
1257        "utils": [],
1258        "graphics": [],
1259        "stats": [],
1260        "Biobase": [
1261            "R",
1262            "BiocGenerics",
1263            "utils",
1264            "methods"
1265        ],
1266        "a4Core": [
1267            "Biobase",
1268            "glmnet",
1269            "methods",
1270            "stats"
1271        ],
1272        "glmnet": [
1273            "R",
1274            "Matrix",
1275            "methods",
1276            "utils",
1277            "foreach",
1278            "shape",
1279            "survival",
1280            "Rcpp"
1281        ],
1282        "Matrix": [
1283            "R",
1284            "methods",
1285            "graphics",
1286            "grid",
1287            "lattice",
1288            "stats",
1289            "utils"
1290        ],
1291        "foreach": [
1292            "R",
1293            "codetools",
1294            "utils",
1295            "iterators"
1296        ],
1297        "shape": [
1298            "R",
1299            "stats",
1300            "graphics",
1301            "grDevices"
1302        ],
1303        "survival": [
1304            "R",
1305            "graphics",
1306            "Matrix",
1307            "methods",
1308            "splines",
1309            "stats",
1310            "utils"
1311        ],
1312        "Rcpp": [
1313            "methods",
1314            "utils"
1315        ],
1316        "grid": [],
1317        "annaffy": [
1318            "R",
1319            "methods",
1320            "Biobase",
1321            "BiocManager",
1322            "GO.db",
1323            "AnnotationDbi",
1324            "DBI"
1325        ],
1326        "BiocManager": [
1327            "utils"
1328        ],
1329        "GO.db": [],
1330        "AnnotationDbi": [
1331            "R",
1332            "methods",
1333            "stats4",
1334            "BiocGenerics",
1335            "Biobase",
1336            "IRanges",
1337            "DBI",
1338            "RSQLite",
1339            "S4Vectors",
1340            "stats",
1341            "KEGGREST"
1342        ],
1343        "stats4": [],
1344        "IRanges": [
1345            "R",
1346            "methods",
1347            "utils",
1348            "stats",
1349            "BiocGenerics",
1350            "S4Vectors",
1351            "stats4"
1352        ],
1353        "DBI": [
1354            "methods",
1355            "R"
1356        ],
1357        "RSQLite": [
1358            "R",
1359            "bit64",
1360            "blob",
1361            "DBI",
1362            "memoise",
1363            "methods",
1364            "pkgconfig"
1365        ],
1366        "S4Vectors": [
1367            "R",
1368            "methods",
1369            "utils",
1370            "stats",
1371            "stats4",
1372            "BiocGenerics"
1373        ],
1374        "KEGGREST": [
1375            "R",
1376            "methods",
1377            "httr",
1378            "png",
1379            "Biostrings"
1380        ],
1381        "mpm": [
1382            "R",
1383            "MASS",
1384            "KernSmooth"
1385        ],
1386        "MASS": [
1387            "R",
1388            "grDevices",
1389            "graphics",
1390            "stats",
1391            "utils",
1392            "methods"
1393        ],
1394        "grDevices": [],
1395        "KernSmooth": [
1396            "R",
1397            "stats"
1398        ],
1399        "genefilter": [
1400            "MatrixGenerics",
1401            "AnnotationDbi",
1402            "annotate",
1403            "Biobase",
1404            "graphics",
1405            "methods",
1406            "stats",
1407            "survival",
1408            "grDevices"
1409        ],
1410        "MatrixGenerics": [
1411            "matrixStats",
1412            "methods"
1413        ],
1414        "matrixStats": [
1415            "R"
1416        ],
1417        "annotate": [
1418            "R",
1419            "AnnotationDbi",
1420            "XML",
1421            "Biobase",
1422            "DBI",
1423            "xtable",
1424            "graphics",
1425            "utils",
1426            "stats",
1427            "methods",
1428            "BiocGenerics",
1429            "httr"
1430        ],
1431        "XML": [
1432            "R",
1433            "methods",
1434            "utils"
1435        ],
1436        "xtable": [
1437            "R",
1438            "stats",
1439            "utils"
1440        ],
1441        "httr": [
1442            "R",
1443            "curl",
1444            "jsonlite",
1445            "mime",
1446            "openssl",
1447            "R6"
1448        ],
1449        "limma": [
1450            "R",
1451            "grDevices",
1452            "graphics",
1453            "stats",
1454            "utils",
1455            "methods"
1456        ],
1457        "multtest": [
1458            "R",
1459            "methods",
1460            "BiocGenerics",
1461            "Biobase",
1462            "survival",
1463            "MASS",
1464            "stats4"
1465        ],
1466        "gplots": [
1467            "R",
1468            "gtools",
1469            "stats",
1470            "caTools",
1471            "KernSmooth",
1472            "methods"
1473        ],
1474        "gtools": [
1475            "methods",
1476            "stats",
1477            "utils"
1478        ],
1479        "caTools": [
1480            "R",
1481            "bitops"
1482        ],
1483        "bitops": [],
1484        "a4Classif": [
1485            "a4Core",
1486            "a4Preproc",
1487            "methods",
1488            "Biobase",
1489            "ROCR",
1490            "pamr",
1491            "glmnet",
1492            "varSelRF",
1493            "utils",
1494            "graphics",
1495            "stats"
1496        ],
1497        "ROCR": [
1498            "R",
1499            "methods",
1500            "graphics",
1501            "grDevices",
1502            "gplots",
1503            "stats"
1504        ],
1505        "pamr": [
1506            "R",
1507            "cluster",
1508            "survival"
1509        ],
1510        "cluster": [
1511            "R",
1512            "graphics",
1513            "grDevices",
1514            "stats",
1515            "utils"
1516        ],
1517        "varSelRF": [
1518            "R",
1519            "randomForest",
1520            "parallel"
1521        ],
1522        "randomForest": [
1523            "R",
1524            "stats"
1525        ],
1526        "parallel": [],
1527        "a4Reporting": [
1528            "methods",
1529            "xtable"
1530        ]
1531    }
1532
1533
1534'''
1535
1536from __future__ import annotations
1537from typing import Dict, List, Optional, Union
1538import pickle
1539import tqdm
1540import pandas as pd
1541import networkx as nx
1542
1543from .utilities.config import Configuration
1544from .myrequests.request_handler import RequestHandler
1545from .utilities.logger import MyLogger
1546from .data_source.data_source import DataSource
1547from .data_source.scraper_ds import ScraperDataSource
1548from .data_source.csv_ds import CSVDataSource
1549from .data_source.librariesio_ds import LibrariesioDataSource
1550from .data_source.repository_scrapers.github import GithubScraper
1551from .package import Package
1552
1553
1554class PackageManager():
1555    '''
1556    Class that represents a package manager, which provides a way to obtain packages from a data source and store them
1557    in a dictionary
1558    '''
1559
1560    def __init__(self, data_sources: Optional[List[DataSource]] = None):
1561        '''
1562        Constructor of the PackageManager class
1563
1564        Parameters
1565        ----------
1566
1567        data_sources : Optional[List[DataSource]]
1568            List of data sources to obtain the packages, if None, an empty list will be used
1569
1570        Raises
1571        ------
1572        ValueError
1573            If the data_sources parameter is None or empty
1574
1575        Examples
1576        --------
1577        >>> package_manager = PackageManager("My package manager", [CSVDataSource("csv_data_source", "path/to/file.csv")])
1578        '''
1579
1580        if not data_sources:
1581            raise ValueError("Data source cannot be empty")
1582
1583        self.data_sources: List[DataSource] = data_sources
1584        self.packages: Dict[str, Package] = {}
1585
1586        # Init the logger for the package manager
1587        self.logger = MyLogger.get_logger('logger_packagemanager')
1588
1589
1590    def save(self, path: str):
1591        '''
1592        Saves the package manager to a file, normally it has the extension .olvpm for easy identification
1593        as an Olivia package manager file
1594
1595        Parameters
1596        ----------
1597        path : str
1598            Path of the file to save the package manager
1599        '''
1600
1601        # Remove redundant objects
1602        for data_source in self.data_sources:
1603            if isinstance(data_source, ScraperDataSource):
1604                try:
1605                    del data_source.request_handler
1606                except AttributeError:
1607                    pass
1608        
1609        try:
1610
1611            # Use pickle to save the package manager
1612            with open(path, "wb") as f:
1613                pickle.dump(self, f, protocol=pickle.HIGHEST_PROTOCOL)
1614
1615        except Exception as e:
1616            raise PackageManagerSaveError(f"Error saving package manager: {e}") from e
1617
1618    @classmethod
1619    def load_from_persistence(cls, path: str):
1620        '''_fro
1621        Load the package manager from a file, the file must have been created with the save method
1622        Normally, it has the extension .olvpm
1623
1624        Parameters
1625        ----------
1626        path : str
1627            Path of the file to load the package manager
1628
1629        Returns
1630        -------
1631        Union[PackageManager, None] 
1632            PackageManager object if the file exists and is valid, None otherwise
1633        '''
1634
1635        # Init the logger for the package manager
1636        logger = MyLogger.get_logger("logger_packagemanager")
1637
1638        # Try to load the package manager from the file
1639        try:
1640            # Use pickle to load the package manager
1641            logger.info(f"Loading package manager from {path}")
1642            with open(path, "rb") as f:
1643                obj = pickle.load(f)
1644                logger.info("Package manager loaded")
1645        except PackageManagerLoadError:
1646            logger.error(f"Error loading package manager from {path}")
1647            return None
1648
1649        if not isinstance(obj, PackageManager):
1650            return None
1651        
1652        # Set the request handler for the scraper data sources
1653        for data_source in obj.data_sources:
1654            if isinstance(data_source, ScraperDataSource):
1655                data_source.request_handler = RequestHandler()
1656                # Set the logger for the scraper data source
1657                data_source.logger = MyLogger.get_logger("logger_datasource")
1658
1659        obj.logger = logger
1660
1661        return obj
1662
1663    @classmethod
1664    def load_from_csv(
1665        cls,
1666        csv_path: str,
1667        dependent_field: Optional[str] = None,
1668        dependency_field: Optional[str] = None, 
1669        version_field: Optional[str] = None,
1670        dependency_version_field: Optional[str] = None,
1671        url_field: Optional[str] = None,
1672        default_format: Optional[bool] = False,
1673    ) -> PackageManager:
1674        '''
1675        Load a csv file into a PackageManager object
1676
1677        Parameters
1678        ----------
1679        csv_path : str
1680            Path of the csv file to load
1681        dependent_field : str = None, optional
1682            Name of the dependent field, by default None
1683        dependency_field : str = None, optional
1684            Name of the dependency field, by default None
1685        version_field : str = None, optional
1686            Name of the version field, by default None
1687        dependency_version_field : str = None, optional
1688            Name of the dependency version field, by default None
1689        url_field : str = None, optional
1690            Name of the url field, by default None
1691        default_format : bool, optional
1692            If True, the csv has the structure of full_adjlist.csv, by default False
1693
1694        Examples
1695        --------
1696        >>> pm = PackageManager.load_csv_adjlist(
1697            "full_adjlist.csv",
1698            dependent_field="dependent",
1699            dependency_field="dependency",
1700            version_field="version",
1701            dependency_version_field="dependency_version",
1702            url_field="url"
1703        )
1704        >>> pm = PackageManager.load_csv_adjlist("full_adjlist.csv", default_format=True)
1705
1706        '''
1707
1708        # Init the logger for the package manager
1709        logger = MyLogger.get_logger('logger_packagemanager')
1710
1711        try:
1712            logger.info(f"Loading csv file from {csv_path}")
1713            data = pd.read_csv(csv_path)
1714        except Exception as e:
1715            logger.error(f"Error loading csv file: {e}")
1716            raise PackageManagerLoadError(f"Error loading csv file: {e}") from e
1717
1718        csv_fields = []
1719
1720        if default_format:
1721            # If the csv has the structure of full_adjlist.csv, we use the default fields
1722            dependent_field = 'name'
1723            dependency_field = 'dependency'
1724            version_field = 'version'
1725            dependency_version_field = 'dependency_version'
1726            url_field = 'url'
1727            csv_fields = [dependent_field, dependency_field,
1728                        version_field, dependency_version_field, url_field]
1729        else:
1730            if dependent_field is None or dependency_field is None:
1731                raise PackageManagerLoadError(
1732                    "Dependent and dependency fields must be specified")
1733
1734            csv_fields = [dependent_field, dependency_field]
1735            # If the optional fields are specified, we add them to the list
1736            if version_field is not None:
1737                csv_fields.append(version_field)
1738            if dependency_version_field is not None:
1739                csv_fields.append(dependency_version_field)
1740            if url_field is not None:
1741                csv_fields.append(url_field)
1742
1743        # If the csv does not have the specified fields, we raise an error
1744        if any(col not in data.columns for col in csv_fields):
1745            logger.error("Invalid csv format")
1746            raise PackageManagerLoadError("Invalid csv format")
1747
1748        # We create the data source
1749        data_source = CSVDataSource(
1750            file_path=csv_path,
1751            dependent_field=dependent_field,
1752            dependency_field=dependency_field,
1753            dependent_version_field=version_field,
1754            dependency_version_field=dependency_version_field,
1755            dependent_url_field=url_field
1756        )
1757
1758        obj = cls([data_source])
1759
1760        # Add the logger to the package manager
1761        obj.logger = logger
1762        
1763        # return the package manager
1764        return obj
1765
1766    def initialize(
1767        self, 
1768        package_names: Optional[List[str]] = None, 
1769        show_progress: Optional[bool] = False, 
1770        chunk_size: Optional[int] = 10000):
1771        '''
1772        Initializes the package manager by loading the packages from the data source
1773
1774        Parameters
1775        ----------
1776        package_list : List[str]
1777            List of package names to load, if None, all the packages will be loaded
1778        show_progress : bool
1779            If True, a progress bar will be shown
1780        chunk_size : int
1781            Size of the chunks to load the packages, this is done to avoid memory errors
1782
1783    .. warning:: 
1784    
1785        For large package lists, this method can take a long time to complete
1786
1787        '''
1788
1789        # Get package names from the data sources if needed
1790        if package_names is None:
1791            for data_source in self.data_sources:
1792                try:
1793                    package_names = data_source.obtain_package_names()
1794                    break
1795                except NotImplementedError as e:
1796                    self.logger.debug(f"Data source {data_source} does not implement obtain_package_names method: {e}")
1797                    continue
1798                except Exception as e:
1799                    self.logger.error(f"Error while obtaining package names from data source: {e}")
1800                    continue
1801
1802        # Check if the package names are valid
1803        if package_names is None or not isinstance(package_names, list):
1804            raise ValueError("No valid package names found")
1805
1806        # Instantiate the progress bar if needed
1807        progress_bar = tqdm.tqdm(
1808            total=len(package_names),
1809            colour="green",
1810            desc="Loading packages",
1811            unit="packages",
1812        ) if show_progress else None
1813
1814        # Create a chunked list of package names
1815        # This is done to avoid memory errors
1816        package_names_chunked = [package_names[i:i + chunk_size] for i in range(0, len(package_names), chunk_size)]
1817
1818        for package_names in package_names_chunked:
1819            # Obtain the packages data from the data source and store them
1820            self.fetch_packages(
1821                package_names=package_names, 
1822                progress_bar=progress_bar,
1823                extend=True
1824            )
1825
1826        # Close the progress bar if needed
1827        if progress_bar is not None:
1828            progress_bar.close()
1829
1830    def fetch_package(self, package_name: str) -> Union[Package, None]:
1831        '''
1832        Builds a Package object using the data sources in order until one of them returns a valid package
1833
1834        Parameters
1835        ----------
1836        package_name : str
1837            Name of the package
1838
1839        Returns
1840        -------
1841        Union[Package, None]
1842            Package object if the package exists, None otherwise
1843
1844        Examples
1845        --------
1846        >>> package = package_manager.obtain_package("package_name")
1847        >>> package
1848        <Package: package_name>
1849        '''
1850        # Obtain the package data from the data sources in order
1851        package_data = None
1852        for data_source in self.data_sources:
1853            
1854            if isinstance(data_source, (GithubScraper, CSVDataSource, ScraperDataSource, LibrariesioDataSource)):
1855                package_data = data_source.obtain_package_data(package_name)
1856            else:
1857                package_data = self.get_package(package_name).to_dict()
1858
1859            if package_data is not None:
1860                self.logger.debug(f"Package {package_name} found using {data_source.__class__.__name__}")
1861                break
1862            else:
1863                self.logger.debug(f"Package {package_name} not found using {data_source.__class__.__name__}")
1864
1865
1866        # Return the package if it exists
1867        return None if package_data is None else Package.load(package_data)
1868
1869    def fetch_packages(
1870        self,
1871        package_names: List[str],
1872        progress_bar: Optional[tqdm.tqdm],
1873        extend: bool = False
1874    ) -> List[Package]:
1875        '''
1876        Builds a list of Package objects using the data sources in order until one of them returns a valid package
1877
1878        Parameters
1879        ----------
1880        package_names : List[str]
1881            List of package names
1882        progress_bar : tqdm.tqdm
1883            Progress bar to show the progress of the operation
1884        extend : bool
1885            If True, the packages will be added to the existing ones, otherwise, the existing ones will be replaced
1886
1887        Returns
1888        -------
1889        List[Package]
1890            List of Package objects
1891            
1892        Examples
1893        --------
1894        >>> packages = package_manager.obtain_packages(["package_name_1", "package_name_2"])
1895        >>> packages
1896        [<Package: package_name_1>, <Package: package_name_2>]
1897        '''
1898
1899        # Check if the package names are valid
1900        if not isinstance(package_names, list):
1901            raise ValueError("Package names must be a list")
1902
1903        preferred_data_source = self.data_sources[0]
1904
1905        # Return list
1906        packages = []
1907
1908        # if datasource is instance of ScraperDataSource use the obtain_packages_data method for parallelization
1909        if isinstance(preferred_data_source, ScraperDataSource):
1910            
1911            packages_data = []
1912            data_found, not_found = preferred_data_source.obtain_packages_data(
1913                package_names=package_names, 
1914                progress_bar=progress_bar # type: ignore
1915            )
1916            packages_data.extend(data_found)
1917            # pending_packages = not_found
1918            self.logger.info(f"Packages found: {len(data_found)}, Packages not found: {len(not_found)}")
1919            packages = [Package.load(package_data) for package_data in packages_data]
1920            
1921        # if not use the obtain_package_data method for sequential processing using the data_sources of the list
1922        else:
1923
1924            while len(package_names) > 0:
1925                
1926                package_name = package_names[0]
1927                package_data = self.fetch_package(package_name)
1928                if package_data is not None:
1929                    packages.append(package_data)
1930
1931                # Remove the package from the pending packages
1932                del package_names[0]
1933
1934                if progress_bar is not None:
1935                    progress_bar.update(1)
1936        
1937        self.logger.info(f"Total packages found: {len(packages)}")
1938        
1939        # update the self.packages attribute overwriting the packages with the same name
1940        # but conserving the other packages
1941        if extend:
1942            self.logger.info("Extending data source with obtained packages")
1943            for package in packages:
1944                self.packages[package.name] = package
1945
1946        return packages
1947            
1948    def get_package(self, package_name: str) -> Union[Package, None]:
1949        '''
1950        Obtain a package from the package manager
1951
1952        Parameters
1953        ----------
1954        package_name : str
1955            Name of the package
1956
1957        Returns
1958        -------
1959        Union[Package, None]
1960            Package object if the package exists, None otherwise
1961
1962        Examples
1963        --------
1964        >>> package = package_manager.get_package("package_name")
1965        >>> print(package.name)
1966        '''
1967        return self.packages.get(package_name, None)
1968
1969    def get_packages(self) -> List[Package]:
1970        '''
1971        Obtain the list of packages of the package manager
1972
1973        Returns
1974        -------
1975        List[Package]
1976            List of packages of the package manager
1977
1978        Examples
1979        --------
1980        >>> package_list = package_manager.get_package_list()
1981        '''
1982        return list(self.packages.values())
1983
1984    def package_names(self) -> List[str]:
1985        '''
1986        Obtain the list of package names of the package manager
1987
1988        Returns
1989        -------
1990        List[str]
1991            List of package names of the package manager
1992
1993        Examples
1994        --------
1995        >>> package_names = package_manager.get_package_names()
1996        '''
1997        return list(self.packages.keys())
1998
1999    def fetch_package_names(self) -> List[str]:
2000        '''
2001        Obtain the list of package names of the package manager
2002
2003        Returns
2004        -------
2005        List[str]
2006            List of package names of the package manager
2007
2008        Examples
2009        --------
2010        >>> package_names = package_manager.obtain_package_names()
2011        '''
2012
2013        return self.data_sources[0].obtain_package_names()
2014
2015    def export_dataframe(self, full_data = False) -> pd.DataFrame:
2016        '''
2017        Convert the object to a adjacency list, where each row represents a dependency
2018        If a package has'nt dependencies, it will appear in the list with dependency field empty
2019
2020        Parameters
2021        ----------
2022        full_data : bool, optional
2023            If True, the adjacency list will contain the version and url of the packages, by default False
2024
2025        Returns
2026        -------
2027        pd.DataFrame
2028            Dependency network as an adjacency list
2029
2030        Examples    
2031        --------
2032        >>> adj_list = package_manager.export_adjlist()
2033        >>> print(adj_list)
2034            [name, dependency]
2035        '''
2036
2037        if not self.packages:
2038            self.logger.debug("The package manager is empty")
2039            return pd.DataFrame()
2040                    
2041
2042        rows = []
2043
2044        if full_data:
2045            for package_name in self.packages.keys():
2046                package = self.get_package(package_name)
2047
2048
2049                for dependency in package.dependencies:
2050                    
2051                    try:
2052                        dependency_full = self.get_package(dependency.name)
2053                        rows.append(
2054                            [package.name, package.version, package.url, dependency_full.name, dependency_full.version, dependency_full.url]
2055                        )
2056                    except Exception:
2057                        if dependency.name is not None:
2058                            rows.append(
2059                                [package.name, package.version, package.url, dependency.name, None, None]
2060                            )
2061
2062
2063            return pd.DataFrame(rows, columns=['name', 'version', 'url', 'dependency', 'dependency_version', 'dependency_url'])
2064        else:
2065            for package_name in self.packages.keys():
2066                package = self.get_package(package_name)
2067                rows.extend(
2068                    [package.name, dependency.name]
2069                    for dependency in package.dependencies
2070                )
2071            return pd.DataFrame(rows, columns=['name', 'dependency'])
2072
2073    def get_adjlist(self, package_name: str, adjlist: Optional[Dict] = None, deep_level: int = 5) -> Dict[str, List[str]]:
2074        """
2075        Generates the dependency network of a package from the data source.
2076
2077        Parameters
2078        ----------
2079        package_name : str
2080            The name of the package to generate the dependency network
2081        adjlist : Optional[Dict], optional
2082            The dependency network of the package, by default None
2083        deep_level : int, optional
2084            The deep level of the dependency network, by default 5
2085
2086        Returns
2087        -------
2088        Dict[str, List[str]]
2089            The dependency network of the package
2090        """
2091
2092        # If the deep level is 0, we return the dependency network (Stop condition)
2093        if deep_level == 0:
2094            return adjlist
2095
2096        # If the dependency network is not specified, we create it (Initial case)
2097        if adjlist is None:
2098            adjlist = {}
2099
2100        # If the package is already in the dependency network, we return it (Stop condition)
2101        if package_name in adjlist:
2102            return adjlist
2103
2104        # Use the data of the package manager
2105        current_package = self.get_package(package_name)
2106        dependencies =  current_package.get_dependencies_names() if current_package is not None else []
2107
2108        # Get the dependencies of the package and add it to the dependency network if it is not already in it
2109        adjlist[package_name] = dependencies
2110
2111        # Append the dependencies of the package to the dependency network
2112        for dependency_name in dependencies:
2113
2114            if (dependency_name not in adjlist) and  (self.get_package(dependency_name) is not None):
2115
2116                adjlist = self.get_adjlist(
2117                    package_name = dependency_name, 
2118                    adjlist = adjlist, 
2119                    deep_level = deep_level - 1,
2120                )
2121
2122        return adjlist
2123
2124    def fetch_adjlist(self, package_name: str, deep_level: int = 5, adjlist: dict = None) -> Dict[str, List[str]]:
2125        """
2126        Generates the dependency network of a package from the data source.
2127
2128        Parameters
2129        ----------
2130        package_name : str
2131            The name of the package to generate the dependency network
2132        deep_level : int, optional
2133            The deep level of the dependency network, by default 5
2134        dependency_network : dict, optional
2135            The dependency network of the package
2136
2137        Returns
2138        -------
2139        Dict[str, List[str]]
2140            The dependency network of the package
2141        """
2142
2143        if adjlist is None:
2144            adjlist = {}
2145
2146        # If the deep level is 0, we return the adjacency list (Stop condition) 
2147        if deep_level == 0 or package_name in adjlist:
2148            return adjlist
2149
2150        dependencies = []
2151        try:
2152            current_package = self.fetch_package(package_name)
2153            dependencies = current_package.get_dependencies_names()
2154
2155        except Exception as e:
2156            self.logger.debug(f"Package {package_name} not found: {e}")
2157
2158        # Add the package to the adjacency list if it is not already in it
2159        adjlist[package_name] = dependencies
2160
2161        # Append the dependencies of the package to the adjacency list if they are not already in it
2162        for dependency_name in dependencies:
2163            if dependency_name not in adjlist:
2164                try:     
2165                    adjlist = self.fetch_adjlist(
2166                        package_name=dependency_name,                # The name of the dependency
2167                        deep_level=deep_level - 1,                   # The deep level is reduced by 1
2168                        adjlist=adjlist                              # The global adjacency list
2169                    )
2170                except Exception:
2171                    self.logger.debug(
2172                        f"The package {dependency_name}, as dependency of {package_name} does not exist in the data source"
2173                    )
2174
2175        return adjlist
2176
2177    def __add_chunk(self,
2178        df, G,
2179        filter_field=None,
2180        filter_value=None
2181    ):
2182        
2183        filtered = df[df[filter_field] == filter_value] if filter_field else df
2184        links = list(zip(filtered["name"], filtered["dependency"]))
2185        G.add_edges_from(links)
2186        return G
2187
2188    def get_network_graph(
2189            self, chunk_size = int(1e6), 
2190            source_field = "dependency", target_field = "name",
2191            filter_field=None, filter_value=None) -> nx.DiGraph:
2192        """
2193        Builds a dependency network graph from a dataframe of dependencies.
2194        The dataframe must have two columns: dependent and dependency.
2195
2196        Parameters
2197        ----------
2198        chunk_size : int
2199            Number of rows to process at a time
2200        source_field : str
2201            Name of the column containing the source node
2202        target_field : str
2203            Name of the column containing the target node
2204        filter_field : str, optional
2205            Name of the column to filter on, by default None
2206        filter_value : str, optional
2207            Value to filter on, by default None
2208
2209        Returns
2210        -------
2211        nx.DiGraph
2212            Directed graph of dependencies
2213        """
2214
2215
2216        # If the default dtasource is a CSV_Datasource, we use custom implementation
2217        defaul_datasource = self.__get_default_datasource()
2218        if isinstance(defaul_datasource, CSVDataSource):
2219            return nx.from_pandas_edgelist(
2220                defaul_datasource.data, source=source_field, 
2221                target=target_field, create_using=nx.DiGraph()
2222            )
2223
2224        # If the default datasource is not a CSV_Datasource, we use the default implementation
2225        df = self.export_dataframe()
2226        try:
2227            # New NetworkX directed Graph
2228            G = nx.DiGraph()
2229            
2230            for i in range(0, len(df), chunk_size):
2231                chunk = df.iloc[i:i+chunk_size]
2232                # Add dependencies from chunk to G
2233                G = self.__add_chunk(
2234                    chunk,
2235                    G,
2236                    filter_field=filter_field,
2237                    filter_value=filter_value
2238                )
2239        
2240            return G
2241        
2242        except Exception as e:
2243            print('\n', e)
2244
2245    def get_transitive_network_graph(self, package_name: str, deep_level: int = 5, generate = False) -> nx.DiGraph:
2246        """
2247        Gets the transitive dependency network of a package as a NetworkX graph.
2248
2249        Parameters
2250        ----------
2251        package_name : str
2252            The name of the package to get the dependency network
2253        deep_level : int, optional
2254            The deep level of the dependency network, by default 5
2255        generate : bool, optional
2256            If True, the dependency network is generated from the data source, by default False
2257
2258        Returns
2259        -------
2260        nx.DiGraph
2261            The dependency network of the package
2262        """
2263
2264        if generate:
2265            # Get the dependency network from the data source
2266            dependency_network = self.fetch_adjlist(package_name=package_name, deep_level=deep_level, adjlist={})
2267
2268        else:
2269            # Get the dependency network from in-memory data
2270            dependency_network = self.get_adjlist(package_name=package_name, deep_level=deep_level)
2271
2272        # Create a NetworkX graph of the dependency network as (DEPENDENCY ---> PACKAGE)
2273        G = nx.DiGraph()
2274        for package_name, dependencies in dependency_network.items():
2275            for dependency_name in dependencies:
2276                G.add_edge(dependency_name, package_name)
2277
2278        return G
2279    
2280    def __get_default_datasource(self):
2281        """
2282        Gets the default data source
2283
2284        Returns
2285        -------
2286        DataSource
2287            The default data source
2288        """
2289
2290        return self.data_sources[0] if len(self.data_sources) > 0 else None
2291
2292class PackageManagerLoadError(Exception):
2293    """
2294    Exception raised when an error occurs while loading a package manager
2295
2296    Attributes
2297    ----------
2298    message : str
2299        Error message
2300    """
2301
2302    def __init__(self, message):
2303        self.message = message
2304        super().__init__(self.message)
2305
2306class PackageManagerSaveError(Exception):
2307    """
2308    Exception raised when an error occurs while saving a package manager
2309
2310    Attributes
2311    ----------
2312    message : str
2313        Error message
2314    """
2315
2316    def __init__(self, message):
2317        self.message = message
2318        super().__init__(self.message)
class PackageManager:
1555class PackageManager():
1556    '''
1557    Class that represents a package manager, which provides a way to obtain packages from a data source and store them
1558    in a dictionary
1559    '''
1560
1561    def __init__(self, data_sources: Optional[List[DataSource]] = None):
1562        '''
1563        Constructor of the PackageManager class
1564
1565        Parameters
1566        ----------
1567
1568        data_sources : Optional[List[DataSource]]
1569            List of data sources to obtain the packages, if None, an empty list will be used
1570
1571        Raises
1572        ------
1573        ValueError
1574            If the data_sources parameter is None or empty
1575
1576        Examples
1577        --------
1578        >>> package_manager = PackageManager("My package manager", [CSVDataSource("csv_data_source", "path/to/file.csv")])
1579        '''
1580
1581        if not data_sources:
1582            raise ValueError("Data source cannot be empty")
1583
1584        self.data_sources: List[DataSource] = data_sources
1585        self.packages: Dict[str, Package] = {}
1586
1587        # Init the logger for the package manager
1588        self.logger = MyLogger.get_logger('logger_packagemanager')
1589
1590
1591    def save(self, path: str):
1592        '''
1593        Saves the package manager to a file, normally it has the extension .olvpm for easy identification
1594        as an Olivia package manager file
1595
1596        Parameters
1597        ----------
1598        path : str
1599            Path of the file to save the package manager
1600        '''
1601
1602        # Remove redundant objects
1603        for data_source in self.data_sources:
1604            if isinstance(data_source, ScraperDataSource):
1605                try:
1606                    del data_source.request_handler
1607                except AttributeError:
1608                    pass
1609        
1610        try:
1611
1612            # Use pickle to save the package manager
1613            with open(path, "wb") as f:
1614                pickle.dump(self, f, protocol=pickle.HIGHEST_PROTOCOL)
1615
1616        except Exception as e:
1617            raise PackageManagerSaveError(f"Error saving package manager: {e}") from e
1618
1619    @classmethod
1620    def load_from_persistence(cls, path: str):
1621        '''_fro
1622        Load the package manager from a file, the file must have been created with the save method
1623        Normally, it has the extension .olvpm
1624
1625        Parameters
1626        ----------
1627        path : str
1628            Path of the file to load the package manager
1629
1630        Returns
1631        -------
1632        Union[PackageManager, None] 
1633            PackageManager object if the file exists and is valid, None otherwise
1634        '''
1635
1636        # Init the logger for the package manager
1637        logger = MyLogger.get_logger("logger_packagemanager")
1638
1639        # Try to load the package manager from the file
1640        try:
1641            # Use pickle to load the package manager
1642            logger.info(f"Loading package manager from {path}")
1643            with open(path, "rb") as f:
1644                obj = pickle.load(f)
1645                logger.info("Package manager loaded")
1646        except PackageManagerLoadError:
1647            logger.error(f"Error loading package manager from {path}")
1648            return None
1649
1650        if not isinstance(obj, PackageManager):
1651            return None
1652        
1653        # Set the request handler for the scraper data sources
1654        for data_source in obj.data_sources:
1655            if isinstance(data_source, ScraperDataSource):
1656                data_source.request_handler = RequestHandler()
1657                # Set the logger for the scraper data source
1658                data_source.logger = MyLogger.get_logger("logger_datasource")
1659
1660        obj.logger = logger
1661
1662        return obj
1663
1664    @classmethod
1665    def load_from_csv(
1666        cls,
1667        csv_path: str,
1668        dependent_field: Optional[str] = None,
1669        dependency_field: Optional[str] = None, 
1670        version_field: Optional[str] = None,
1671        dependency_version_field: Optional[str] = None,
1672        url_field: Optional[str] = None,
1673        default_format: Optional[bool] = False,
1674    ) -> PackageManager:
1675        '''
1676        Load a csv file into a PackageManager object
1677
1678        Parameters
1679        ----------
1680        csv_path : str
1681            Path of the csv file to load
1682        dependent_field : str = None, optional
1683            Name of the dependent field, by default None
1684        dependency_field : str = None, optional
1685            Name of the dependency field, by default None
1686        version_field : str = None, optional
1687            Name of the version field, by default None
1688        dependency_version_field : str = None, optional
1689            Name of the dependency version field, by default None
1690        url_field : str = None, optional
1691            Name of the url field, by default None
1692        default_format : bool, optional
1693            If True, the csv has the structure of full_adjlist.csv, by default False
1694
1695        Examples
1696        --------
1697        >>> pm = PackageManager.load_csv_adjlist(
1698            "full_adjlist.csv",
1699            dependent_field="dependent",
1700            dependency_field="dependency",
1701            version_field="version",
1702            dependency_version_field="dependency_version",
1703            url_field="url"
1704        )
1705        >>> pm = PackageManager.load_csv_adjlist("full_adjlist.csv", default_format=True)
1706
1707        '''
1708
1709        # Init the logger for the package manager
1710        logger = MyLogger.get_logger('logger_packagemanager')
1711
1712        try:
1713            logger.info(f"Loading csv file from {csv_path}")
1714            data = pd.read_csv(csv_path)
1715        except Exception as e:
1716            logger.error(f"Error loading csv file: {e}")
1717            raise PackageManagerLoadError(f"Error loading csv file: {e}") from e
1718
1719        csv_fields = []
1720
1721        if default_format:
1722            # If the csv has the structure of full_adjlist.csv, we use the default fields
1723            dependent_field = 'name'
1724            dependency_field = 'dependency'
1725            version_field = 'version'
1726            dependency_version_field = 'dependency_version'
1727            url_field = 'url'
1728            csv_fields = [dependent_field, dependency_field,
1729                        version_field, dependency_version_field, url_field]
1730        else:
1731            if dependent_field is None or dependency_field is None:
1732                raise PackageManagerLoadError(
1733                    "Dependent and dependency fields must be specified")
1734
1735            csv_fields = [dependent_field, dependency_field]
1736            # If the optional fields are specified, we add them to the list
1737            if version_field is not None:
1738                csv_fields.append(version_field)
1739            if dependency_version_field is not None:
1740                csv_fields.append(dependency_version_field)
1741            if url_field is not None:
1742                csv_fields.append(url_field)
1743
1744        # If the csv does not have the specified fields, we raise an error
1745        if any(col not in data.columns for col in csv_fields):
1746            logger.error("Invalid csv format")
1747            raise PackageManagerLoadError("Invalid csv format")
1748
1749        # We create the data source
1750        data_source = CSVDataSource(
1751            file_path=csv_path,
1752            dependent_field=dependent_field,
1753            dependency_field=dependency_field,
1754            dependent_version_field=version_field,
1755            dependency_version_field=dependency_version_field,
1756            dependent_url_field=url_field
1757        )
1758
1759        obj = cls([data_source])
1760
1761        # Add the logger to the package manager
1762        obj.logger = logger
1763        
1764        # return the package manager
1765        return obj
1766
1767    def initialize(
1768        self, 
1769        package_names: Optional[List[str]] = None, 
1770        show_progress: Optional[bool] = False, 
1771        chunk_size: Optional[int] = 10000):
1772        '''
1773        Initializes the package manager by loading the packages from the data source
1774
1775        Parameters
1776        ----------
1777        package_list : List[str]
1778            List of package names to load, if None, all the packages will be loaded
1779        show_progress : bool
1780            If True, a progress bar will be shown
1781        chunk_size : int
1782            Size of the chunks to load the packages, this is done to avoid memory errors
1783
1784    .. warning:: 
1785    
1786        For large package lists, this method can take a long time to complete
1787
1788        '''
1789
1790        # Get package names from the data sources if needed
1791        if package_names is None:
1792            for data_source in self.data_sources:
1793                try:
1794                    package_names = data_source.obtain_package_names()
1795                    break
1796                except NotImplementedError as e:
1797                    self.logger.debug(f"Data source {data_source} does not implement obtain_package_names method: {e}")
1798                    continue
1799                except Exception as e:
1800                    self.logger.error(f"Error while obtaining package names from data source: {e}")
1801                    continue
1802
1803        # Check if the package names are valid
1804        if package_names is None or not isinstance(package_names, list):
1805            raise ValueError("No valid package names found")
1806
1807        # Instantiate the progress bar if needed
1808        progress_bar = tqdm.tqdm(
1809            total=len(package_names),
1810            colour="green",
1811            desc="Loading packages",
1812            unit="packages",
1813        ) if show_progress else None
1814
1815        # Create a chunked list of package names
1816        # This is done to avoid memory errors
1817        package_names_chunked = [package_names[i:i + chunk_size] for i in range(0, len(package_names), chunk_size)]
1818
1819        for package_names in package_names_chunked:
1820            # Obtain the packages data from the data source and store them
1821            self.fetch_packages(
1822                package_names=package_names, 
1823                progress_bar=progress_bar,
1824                extend=True
1825            )
1826
1827        # Close the progress bar if needed
1828        if progress_bar is not None:
1829            progress_bar.close()
1830
1831    def fetch_package(self, package_name: str) -> Union[Package, None]:
1832        '''
1833        Builds a Package object using the data sources in order until one of them returns a valid package
1834
1835        Parameters
1836        ----------
1837        package_name : str
1838            Name of the package
1839
1840        Returns
1841        -------
1842        Union[Package, None]
1843            Package object if the package exists, None otherwise
1844
1845        Examples
1846        --------
1847        >>> package = package_manager.obtain_package("package_name")
1848        >>> package
1849        <Package: package_name>
1850        '''
1851        # Obtain the package data from the data sources in order
1852        package_data = None
1853        for data_source in self.data_sources:
1854            
1855            if isinstance(data_source, (GithubScraper, CSVDataSource, ScraperDataSource, LibrariesioDataSource)):
1856                package_data = data_source.obtain_package_data(package_name)
1857            else:
1858                package_data = self.get_package(package_name).to_dict()
1859
1860            if package_data is not None:
1861                self.logger.debug(f"Package {package_name} found using {data_source.__class__.__name__}")
1862                break
1863            else:
1864                self.logger.debug(f"Package {package_name} not found using {data_source.__class__.__name__}")
1865
1866
1867        # Return the package if it exists
1868        return None if package_data is None else Package.load(package_data)
1869
1870    def fetch_packages(
1871        self,
1872        package_names: List[str],
1873        progress_bar: Optional[tqdm.tqdm],
1874        extend: bool = False
1875    ) -> List[Package]:
1876        '''
1877        Builds a list of Package objects using the data sources in order until one of them returns a valid package
1878
1879        Parameters
1880        ----------
1881        package_names : List[str]
1882            List of package names
1883        progress_bar : tqdm.tqdm
1884            Progress bar to show the progress of the operation
1885        extend : bool
1886            If True, the packages will be added to the existing ones, otherwise, the existing ones will be replaced
1887
1888        Returns
1889        -------
1890        List[Package]
1891            List of Package objects
1892            
1893        Examples
1894        --------
1895        >>> packages = package_manager.obtain_packages(["package_name_1", "package_name_2"])
1896        >>> packages
1897        [<Package: package_name_1>, <Package: package_name_2>]
1898        '''
1899
1900        # Check if the package names are valid
1901        if not isinstance(package_names, list):
1902            raise ValueError("Package names must be a list")
1903
1904        preferred_data_source = self.data_sources[0]
1905
1906        # Return list
1907        packages = []
1908
1909        # if datasource is instance of ScraperDataSource use the obtain_packages_data method for parallelization
1910        if isinstance(preferred_data_source, ScraperDataSource):
1911            
1912            packages_data = []
1913            data_found, not_found = preferred_data_source.obtain_packages_data(
1914                package_names=package_names, 
1915                progress_bar=progress_bar # type: ignore
1916            )
1917            packages_data.extend(data_found)
1918            # pending_packages = not_found
1919            self.logger.info(f"Packages found: {len(data_found)}, Packages not found: {len(not_found)}")
1920            packages = [Package.load(package_data) for package_data in packages_data]
1921            
1922        # if not use the obtain_package_data method for sequential processing using the data_sources of the list
1923        else:
1924
1925            while len(package_names) > 0:
1926                
1927                package_name = package_names[0]
1928                package_data = self.fetch_package(package_name)
1929                if package_data is not None:
1930                    packages.append(package_data)
1931
1932                # Remove the package from the pending packages
1933                del package_names[0]
1934
1935                if progress_bar is not None:
1936                    progress_bar.update(1)
1937        
1938        self.logger.info(f"Total packages found: {len(packages)}")
1939        
1940        # update the self.packages attribute overwriting the packages with the same name
1941        # but conserving the other packages
1942        if extend:
1943            self.logger.info("Extending data source with obtained packages")
1944            for package in packages:
1945                self.packages[package.name] = package
1946
1947        return packages
1948            
1949    def get_package(self, package_name: str) -> Union[Package, None]:
1950        '''
1951        Obtain a package from the package manager
1952
1953        Parameters
1954        ----------
1955        package_name : str
1956            Name of the package
1957
1958        Returns
1959        -------
1960        Union[Package, None]
1961            Package object if the package exists, None otherwise
1962
1963        Examples
1964        --------
1965        >>> package = package_manager.get_package("package_name")
1966        >>> print(package.name)
1967        '''
1968        return self.packages.get(package_name, None)
1969
1970    def get_packages(self) -> List[Package]:
1971        '''
1972        Obtain the list of packages of the package manager
1973
1974        Returns
1975        -------
1976        List[Package]
1977            List of packages of the package manager
1978
1979        Examples
1980        --------
1981        >>> package_list = package_manager.get_package_list()
1982        '''
1983        return list(self.packages.values())
1984
1985    def package_names(self) -> List[str]:
1986        '''
1987        Obtain the list of package names of the package manager
1988
1989        Returns
1990        -------
1991        List[str]
1992            List of package names of the package manager
1993
1994        Examples
1995        --------
1996        >>> package_names = package_manager.get_package_names()
1997        '''
1998        return list(self.packages.keys())
1999
2000    def fetch_package_names(self) -> List[str]:
2001        '''
2002        Obtain the list of package names of the package manager
2003
2004        Returns
2005        -------
2006        List[str]
2007            List of package names of the package manager
2008
2009        Examples
2010        --------
2011        >>> package_names = package_manager.obtain_package_names()
2012        '''
2013
2014        return self.data_sources[0].obtain_package_names()
2015
2016    def export_dataframe(self, full_data = False) -> pd.DataFrame:
2017        '''
2018        Convert the object to a adjacency list, where each row represents a dependency
2019        If a package has'nt dependencies, it will appear in the list with dependency field empty
2020
2021        Parameters
2022        ----------
2023        full_data : bool, optional
2024            If True, the adjacency list will contain the version and url of the packages, by default False
2025
2026        Returns
2027        -------
2028        pd.DataFrame
2029            Dependency network as an adjacency list
2030
2031        Examples    
2032        --------
2033        >>> adj_list = package_manager.export_adjlist()
2034        >>> print(adj_list)
2035            [name, dependency]
2036        '''
2037
2038        if not self.packages:
2039            self.logger.debug("The package manager is empty")
2040            return pd.DataFrame()
2041                    
2042
2043        rows = []
2044
2045        if full_data:
2046            for package_name in self.packages.keys():
2047                package = self.get_package(package_name)
2048
2049
2050                for dependency in package.dependencies:
2051                    
2052                    try:
2053                        dependency_full = self.get_package(dependency.name)
2054                        rows.append(
2055                            [package.name, package.version, package.url, dependency_full.name, dependency_full.version, dependency_full.url]
2056                        )
2057                    except Exception:
2058                        if dependency.name is not None:
2059                            rows.append(
2060                                [package.name, package.version, package.url, dependency.name, None, None]
2061                            )
2062
2063
2064            return pd.DataFrame(rows, columns=['name', 'version', 'url', 'dependency', 'dependency_version', 'dependency_url'])
2065        else:
2066            for package_name in self.packages.keys():
2067                package = self.get_package(package_name)
2068                rows.extend(
2069                    [package.name, dependency.name]
2070                    for dependency in package.dependencies
2071                )
2072            return pd.DataFrame(rows, columns=['name', 'dependency'])
2073
2074    def get_adjlist(self, package_name: str, adjlist: Optional[Dict] = None, deep_level: int = 5) -> Dict[str, List[str]]:
2075        """
2076        Generates the dependency network of a package from the data source.
2077
2078        Parameters
2079        ----------
2080        package_name : str
2081            The name of the package to generate the dependency network
2082        adjlist : Optional[Dict], optional
2083            The dependency network of the package, by default None
2084        deep_level : int, optional
2085            The deep level of the dependency network, by default 5
2086
2087        Returns
2088        -------
2089        Dict[str, List[str]]
2090            The dependency network of the package
2091        """
2092
2093        # If the deep level is 0, we return the dependency network (Stop condition)
2094        if deep_level == 0:
2095            return adjlist
2096
2097        # If the dependency network is not specified, we create it (Initial case)
2098        if adjlist is None:
2099            adjlist = {}
2100
2101        # If the package is already in the dependency network, we return it (Stop condition)
2102        if package_name in adjlist:
2103            return adjlist
2104
2105        # Use the data of the package manager
2106        current_package = self.get_package(package_name)
2107        dependencies =  current_package.get_dependencies_names() if current_package is not None else []
2108
2109        # Get the dependencies of the package and add it to the dependency network if it is not already in it
2110        adjlist[package_name] = dependencies
2111
2112        # Append the dependencies of the package to the dependency network
2113        for dependency_name in dependencies:
2114
2115            if (dependency_name not in adjlist) and  (self.get_package(dependency_name) is not None):
2116
2117                adjlist = self.get_adjlist(
2118                    package_name = dependency_name, 
2119                    adjlist = adjlist, 
2120                    deep_level = deep_level - 1,
2121                )
2122
2123        return adjlist
2124
2125    def fetch_adjlist(self, package_name: str, deep_level: int = 5, adjlist: dict = None) -> Dict[str, List[str]]:
2126        """
2127        Generates the dependency network of a package from the data source.
2128
2129        Parameters
2130        ----------
2131        package_name : str
2132            The name of the package to generate the dependency network
2133        deep_level : int, optional
2134            The deep level of the dependency network, by default 5
2135        dependency_network : dict, optional
2136            The dependency network of the package
2137
2138        Returns
2139        -------
2140        Dict[str, List[str]]
2141            The dependency network of the package
2142        """
2143
2144        if adjlist is None:
2145            adjlist = {}
2146
2147        # If the deep level is 0, we return the adjacency list (Stop condition) 
2148        if deep_level == 0 or package_name in adjlist:
2149            return adjlist
2150
2151        dependencies = []
2152        try:
2153            current_package = self.fetch_package(package_name)
2154            dependencies = current_package.get_dependencies_names()
2155
2156        except Exception as e:
2157            self.logger.debug(f"Package {package_name} not found: {e}")
2158
2159        # Add the package to the adjacency list if it is not already in it
2160        adjlist[package_name] = dependencies
2161
2162        # Append the dependencies of the package to the adjacency list if they are not already in it
2163        for dependency_name in dependencies:
2164            if dependency_name not in adjlist:
2165                try:     
2166                    adjlist = self.fetch_adjlist(
2167                        package_name=dependency_name,                # The name of the dependency
2168                        deep_level=deep_level - 1,                   # The deep level is reduced by 1
2169                        adjlist=adjlist                              # The global adjacency list
2170                    )
2171                except Exception:
2172                    self.logger.debug(
2173                        f"The package {dependency_name}, as dependency of {package_name} does not exist in the data source"
2174                    )
2175
2176        return adjlist
2177
2178    def __add_chunk(self,
2179        df, G,
2180        filter_field=None,
2181        filter_value=None
2182    ):
2183        
2184        filtered = df[df[filter_field] == filter_value] if filter_field else df
2185        links = list(zip(filtered["name"], filtered["dependency"]))
2186        G.add_edges_from(links)
2187        return G
2188
2189    def get_network_graph(
2190            self, chunk_size = int(1e6), 
2191            source_field = "dependency", target_field = "name",
2192            filter_field=None, filter_value=None) -> nx.DiGraph:
2193        """
2194        Builds a dependency network graph from a dataframe of dependencies.
2195        The dataframe must have two columns: dependent and dependency.
2196
2197        Parameters
2198        ----------
2199        chunk_size : int
2200            Number of rows to process at a time
2201        source_field : str
2202            Name of the column containing the source node
2203        target_field : str
2204            Name of the column containing the target node
2205        filter_field : str, optional
2206            Name of the column to filter on, by default None
2207        filter_value : str, optional
2208            Value to filter on, by default None
2209
2210        Returns
2211        -------
2212        nx.DiGraph
2213            Directed graph of dependencies
2214        """
2215
2216
2217        # If the default dtasource is a CSV_Datasource, we use custom implementation
2218        defaul_datasource = self.__get_default_datasource()
2219        if isinstance(defaul_datasource, CSVDataSource):
2220            return nx.from_pandas_edgelist(
2221                defaul_datasource.data, source=source_field, 
2222                target=target_field, create_using=nx.DiGraph()
2223            )
2224
2225        # If the default datasource is not a CSV_Datasource, we use the default implementation
2226        df = self.export_dataframe()
2227        try:
2228            # New NetworkX directed Graph
2229            G = nx.DiGraph()
2230            
2231            for i in range(0, len(df), chunk_size):
2232                chunk = df.iloc[i:i+chunk_size]
2233                # Add dependencies from chunk to G
2234                G = self.__add_chunk(
2235                    chunk,
2236                    G,
2237                    filter_field=filter_field,
2238                    filter_value=filter_value
2239                )
2240        
2241            return G
2242        
2243        except Exception as e:
2244            print('\n', e)
2245
2246    def get_transitive_network_graph(self, package_name: str, deep_level: int = 5, generate = False) -> nx.DiGraph:
2247        """
2248        Gets the transitive dependency network of a package as a NetworkX graph.
2249
2250        Parameters
2251        ----------
2252        package_name : str
2253            The name of the package to get the dependency network
2254        deep_level : int, optional
2255            The deep level of the dependency network, by default 5
2256        generate : bool, optional
2257            If True, the dependency network is generated from the data source, by default False
2258
2259        Returns
2260        -------
2261        nx.DiGraph
2262            The dependency network of the package
2263        """
2264
2265        if generate:
2266            # Get the dependency network from the data source
2267            dependency_network = self.fetch_adjlist(package_name=package_name, deep_level=deep_level, adjlist={})
2268
2269        else:
2270            # Get the dependency network from in-memory data
2271            dependency_network = self.get_adjlist(package_name=package_name, deep_level=deep_level)
2272
2273        # Create a NetworkX graph of the dependency network as (DEPENDENCY ---> PACKAGE)
2274        G = nx.DiGraph()
2275        for package_name, dependencies in dependency_network.items():
2276            for dependency_name in dependencies:
2277                G.add_edge(dependency_name, package_name)
2278
2279        return G
2280    
2281    def __get_default_datasource(self):
2282        """
2283        Gets the default data source
2284
2285        Returns
2286        -------
2287        DataSource
2288            The default data source
2289        """
2290
2291        return self.data_sources[0] if len(self.data_sources) > 0 else None

Class that represents a package manager, which provides a way to obtain packages from a data source and store them in a dictionary

PackageManager( data_sources: Optional[List[olivia_finder.data_source.data_source.DataSource]] = None)
1561    def __init__(self, data_sources: Optional[List[DataSource]] = None):
1562        '''
1563        Constructor of the PackageManager class
1564
1565        Parameters
1566        ----------
1567
1568        data_sources : Optional[List[DataSource]]
1569            List of data sources to obtain the packages, if None, an empty list will be used
1570
1571        Raises
1572        ------
1573        ValueError
1574            If the data_sources parameter is None or empty
1575
1576        Examples
1577        --------
1578        >>> package_manager = PackageManager("My package manager", [CSVDataSource("csv_data_source", "path/to/file.csv")])
1579        '''
1580
1581        if not data_sources:
1582            raise ValueError("Data source cannot be empty")
1583
1584        self.data_sources: List[DataSource] = data_sources
1585        self.packages: Dict[str, Package] = {}
1586
1587        # Init the logger for the package manager
1588        self.logger = MyLogger.get_logger('logger_packagemanager')

Constructor of the PackageManager class

Parameters
  • data_sources (Optional[List[DataSource]]): List of data sources to obtain the packages, if None, an empty list will be used
Raises
  • ValueError: If the data_sources parameter is None or empty
Examples
>>> package_manager = PackageManager("My package manager", [CSVDataSource("csv_data_source", "path/to/file.csv")])
def save(self, path: str):
1591    def save(self, path: str):
1592        '''
1593        Saves the package manager to a file, normally it has the extension .olvpm for easy identification
1594        as an Olivia package manager file
1595
1596        Parameters
1597        ----------
1598        path : str
1599            Path of the file to save the package manager
1600        '''
1601
1602        # Remove redundant objects
1603        for data_source in self.data_sources:
1604            if isinstance(data_source, ScraperDataSource):
1605                try:
1606                    del data_source.request_handler
1607                except AttributeError:
1608                    pass
1609        
1610        try:
1611
1612            # Use pickle to save the package manager
1613            with open(path, "wb") as f:
1614                pickle.dump(self, f, protocol=pickle.HIGHEST_PROTOCOL)
1615
1616        except Exception as e:
1617            raise PackageManagerSaveError(f"Error saving package manager: {e}") from e

Saves the package manager to a file, normally it has the extension .olvpm for easy identification as an Olivia package manager file

Parameters
  • path (str): Path of the file to save the package manager
@classmethod
def load_from_persistence(cls, path: str):
1619    @classmethod
1620    def load_from_persistence(cls, path: str):
1621        '''_fro
1622        Load the package manager from a file, the file must have been created with the save method
1623        Normally, it has the extension .olvpm
1624
1625        Parameters
1626        ----------
1627        path : str
1628            Path of the file to load the package manager
1629
1630        Returns
1631        -------
1632        Union[PackageManager, None] 
1633            PackageManager object if the file exists and is valid, None otherwise
1634        '''
1635
1636        # Init the logger for the package manager
1637        logger = MyLogger.get_logger("logger_packagemanager")
1638
1639        # Try to load the package manager from the file
1640        try:
1641            # Use pickle to load the package manager
1642            logger.info(f"Loading package manager from {path}")
1643            with open(path, "rb") as f:
1644                obj = pickle.load(f)
1645                logger.info("Package manager loaded")
1646        except PackageManagerLoadError:
1647            logger.error(f"Error loading package manager from {path}")
1648            return None
1649
1650        if not isinstance(obj, PackageManager):
1651            return None
1652        
1653        # Set the request handler for the scraper data sources
1654        for data_source in obj.data_sources:
1655            if isinstance(data_source, ScraperDataSource):
1656                data_source.request_handler = RequestHandler()
1657                # Set the logger for the scraper data source
1658                data_source.logger = MyLogger.get_logger("logger_datasource")
1659
1660        obj.logger = logger
1661
1662        return obj

_fro Load the package manager from a file, the file must have been created with the save method Normally, it has the extension .olvpm

Parameters
  • path (str): Path of the file to load the package manager
Returns
  • Union[PackageManager, None]: PackageManager object if the file exists and is valid, None otherwise
@classmethod
def load_from_csv( cls, csv_path: str, dependent_field: Optional[str] = None, dependency_field: Optional[str] = None, version_field: Optional[str] = None, dependency_version_field: Optional[str] = None, url_field: Optional[str] = None, default_format: Optional[bool] = False) -> olivia_finder.package_manager.PackageManager:
1664    @classmethod
1665    def load_from_csv(
1666        cls,
1667        csv_path: str,
1668        dependent_field: Optional[str] = None,
1669        dependency_field: Optional[str] = None, 
1670        version_field: Optional[str] = None,
1671        dependency_version_field: Optional[str] = None,
1672        url_field: Optional[str] = None,
1673        default_format: Optional[bool] = False,
1674    ) -> PackageManager:
1675        '''
1676        Load a csv file into a PackageManager object
1677
1678        Parameters
1679        ----------
1680        csv_path : str
1681            Path of the csv file to load
1682        dependent_field : str = None, optional
1683            Name of the dependent field, by default None
1684        dependency_field : str = None, optional
1685            Name of the dependency field, by default None
1686        version_field : str = None, optional
1687            Name of the version field, by default None
1688        dependency_version_field : str = None, optional
1689            Name of the dependency version field, by default None
1690        url_field : str = None, optional
1691            Name of the url field, by default None
1692        default_format : bool, optional
1693            If True, the csv has the structure of full_adjlist.csv, by default False
1694
1695        Examples
1696        --------
1697        >>> pm = PackageManager.load_csv_adjlist(
1698            "full_adjlist.csv",
1699            dependent_field="dependent",
1700            dependency_field="dependency",
1701            version_field="version",
1702            dependency_version_field="dependency_version",
1703            url_field="url"
1704        )
1705        >>> pm = PackageManager.load_csv_adjlist("full_adjlist.csv", default_format=True)
1706
1707        '''
1708
1709        # Init the logger for the package manager
1710        logger = MyLogger.get_logger('logger_packagemanager')
1711
1712        try:
1713            logger.info(f"Loading csv file from {csv_path}")
1714            data = pd.read_csv(csv_path)
1715        except Exception as e:
1716            logger.error(f"Error loading csv file: {e}")
1717            raise PackageManagerLoadError(f"Error loading csv file: {e}") from e
1718
1719        csv_fields = []
1720
1721        if default_format:
1722            # If the csv has the structure of full_adjlist.csv, we use the default fields
1723            dependent_field = 'name'
1724            dependency_field = 'dependency'
1725            version_field = 'version'
1726            dependency_version_field = 'dependency_version'
1727            url_field = 'url'
1728            csv_fields = [dependent_field, dependency_field,
1729                        version_field, dependency_version_field, url_field]
1730        else:
1731            if dependent_field is None or dependency_field is None:
1732                raise PackageManagerLoadError(
1733                    "Dependent and dependency fields must be specified")
1734
1735            csv_fields = [dependent_field, dependency_field]
1736            # If the optional fields are specified, we add them to the list
1737            if version_field is not None:
1738                csv_fields.append(version_field)
1739            if dependency_version_field is not None:
1740                csv_fields.append(dependency_version_field)
1741            if url_field is not None:
1742                csv_fields.append(url_field)
1743
1744        # If the csv does not have the specified fields, we raise an error
1745        if any(col not in data.columns for col in csv_fields):
1746            logger.error("Invalid csv format")
1747            raise PackageManagerLoadError("Invalid csv format")
1748
1749        # We create the data source
1750        data_source = CSVDataSource(
1751            file_path=csv_path,
1752            dependent_field=dependent_field,
1753            dependency_field=dependency_field,
1754            dependent_version_field=version_field,
1755            dependency_version_field=dependency_version_field,
1756            dependent_url_field=url_field
1757        )
1758
1759        obj = cls([data_source])
1760
1761        # Add the logger to the package manager
1762        obj.logger = logger
1763        
1764        # return the package manager
1765        return obj

Load a csv file into a PackageManager object

Parameters
  • csv_path (str): Path of the csv file to load
  • dependent_field (str = None, optional): Name of the dependent field, by default None
  • dependency_field (str = None, optional): Name of the dependency field, by default None
  • version_field (str = None, optional): Name of the version field, by default None
  • dependency_version_field (str = None, optional): Name of the dependency version field, by default None
  • url_field (str = None, optional): Name of the url field, by default None
  • default_format (bool, optional): If True, the csv has the structure of full_adjlist.csv, by default False
Examples
>>> pm = PackageManager.load_csv_adjlist(
    "full_adjlist.csv",
    dependent_field="dependent",
    dependency_field="dependency",
    version_field="version",
    dependency_version_field="dependency_version",
    url_field="url"
)
>>> pm = PackageManager.load_csv_adjlist("full_adjlist.csv", default_format=True)
def initialize( self, package_names: Optional[List[str]] = None, show_progress: Optional[bool] = False, chunk_size: Optional[int] = 10000):
1767    def initialize(
1768        self, 
1769        package_names: Optional[List[str]] = None, 
1770        show_progress: Optional[bool] = False, 
1771        chunk_size: Optional[int] = 10000):
1772        '''
1773        Initializes the package manager by loading the packages from the data source
1774
1775        Parameters
1776        ----------
1777        package_list : List[str]
1778            List of package names to load, if None, all the packages will be loaded
1779        show_progress : bool
1780            If True, a progress bar will be shown
1781        chunk_size : int
1782            Size of the chunks to load the packages, this is done to avoid memory errors
1783
1784    .. warning:: 
1785    
1786        For large package lists, this method can take a long time to complete
1787
1788        '''
1789
1790        # Get package names from the data sources if needed
1791        if package_names is None:
1792            for data_source in self.data_sources:
1793                try:
1794                    package_names = data_source.obtain_package_names()
1795                    break
1796                except NotImplementedError as e:
1797                    self.logger.debug(f"Data source {data_source} does not implement obtain_package_names method: {e}")
1798                    continue
1799                except Exception as e:
1800                    self.logger.error(f"Error while obtaining package names from data source: {e}")
1801                    continue
1802
1803        # Check if the package names are valid
1804        if package_names is None or not isinstance(package_names, list):
1805            raise ValueError("No valid package names found")
1806
1807        # Instantiate the progress bar if needed
1808        progress_bar = tqdm.tqdm(
1809            total=len(package_names),
1810            colour="green",
1811            desc="Loading packages",
1812            unit="packages",
1813        ) if show_progress else None
1814
1815        # Create a chunked list of package names
1816        # This is done to avoid memory errors
1817        package_names_chunked = [package_names[i:i + chunk_size] for i in range(0, len(package_names), chunk_size)]
1818
1819        for package_names in package_names_chunked:
1820            # Obtain the packages data from the data source and store them
1821            self.fetch_packages(
1822                package_names=package_names, 
1823                progress_bar=progress_bar,
1824                extend=True
1825            )
1826
1827        # Close the progress bar if needed
1828        if progress_bar is not None:
1829            progress_bar.close()

Initializes the package manager by loading the packages from the data source

Parameters
----------
package_list : List[str]
    List of package names to load, if None, all the packages will be loaded
show_progress : bool
    If True, a progress bar will be shown
chunk_size : int
    Size of the chunks to load the packages, this is done to avoid memory errors

For large package lists, this method can take a long time to complete

def fetch_package(self, package_name: str) -> Optional[olivia_finder.package.Package]:
1831    def fetch_package(self, package_name: str) -> Union[Package, None]:
1832        '''
1833        Builds a Package object using the data sources in order until one of them returns a valid package
1834
1835        Parameters
1836        ----------
1837        package_name : str
1838            Name of the package
1839
1840        Returns
1841        -------
1842        Union[Package, None]
1843            Package object if the package exists, None otherwise
1844
1845        Examples
1846        --------
1847        >>> package = package_manager.obtain_package("package_name")
1848        >>> package
1849        <Package: package_name>
1850        '''
1851        # Obtain the package data from the data sources in order
1852        package_data = None
1853        for data_source in self.data_sources:
1854            
1855            if isinstance(data_source, (GithubScraper, CSVDataSource, ScraperDataSource, LibrariesioDataSource)):
1856                package_data = data_source.obtain_package_data(package_name)
1857            else:
1858                package_data = self.get_package(package_name).to_dict()
1859
1860            if package_data is not None:
1861                self.logger.debug(f"Package {package_name} found using {data_source.__class__.__name__}")
1862                break
1863            else:
1864                self.logger.debug(f"Package {package_name} not found using {data_source.__class__.__name__}")
1865
1866
1867        # Return the package if it exists
1868        return None if package_data is None else Package.load(package_data)

Builds a Package object using the data sources in order until one of them returns a valid package

Parameters
  • package_name (str): Name of the package
Returns
  • Union[Package, None]: Package object if the package exists, None otherwise
Examples
>>> package = package_manager.obtain_package("package_name")
>>> package
<Package: package_name>
def fetch_packages( self, package_names: List[str], progress_bar: Optional[tqdm.std.tqdm], extend: bool = False) -> List[olivia_finder.package.Package]:
1870    def fetch_packages(
1871        self,
1872        package_names: List[str],
1873        progress_bar: Optional[tqdm.tqdm],
1874        extend: bool = False
1875    ) -> List[Package]:
1876        '''
1877        Builds a list of Package objects using the data sources in order until one of them returns a valid package
1878
1879        Parameters
1880        ----------
1881        package_names : List[str]
1882            List of package names
1883        progress_bar : tqdm.tqdm
1884            Progress bar to show the progress of the operation
1885        extend : bool
1886            If True, the packages will be added to the existing ones, otherwise, the existing ones will be replaced
1887
1888        Returns
1889        -------
1890        List[Package]
1891            List of Package objects
1892            
1893        Examples
1894        --------
1895        >>> packages = package_manager.obtain_packages(["package_name_1", "package_name_2"])
1896        >>> packages
1897        [<Package: package_name_1>, <Package: package_name_2>]
1898        '''
1899
1900        # Check if the package names are valid
1901        if not isinstance(package_names, list):
1902            raise ValueError("Package names must be a list")
1903
1904        preferred_data_source = self.data_sources[0]
1905
1906        # Return list
1907        packages = []
1908
1909        # if datasource is instance of ScraperDataSource use the obtain_packages_data method for parallelization
1910        if isinstance(preferred_data_source, ScraperDataSource):
1911            
1912            packages_data = []
1913            data_found, not_found = preferred_data_source.obtain_packages_data(
1914                package_names=package_names, 
1915                progress_bar=progress_bar # type: ignore
1916            )
1917            packages_data.extend(data_found)
1918            # pending_packages = not_found
1919            self.logger.info(f"Packages found: {len(data_found)}, Packages not found: {len(not_found)}")
1920            packages = [Package.load(package_data) for package_data in packages_data]
1921            
1922        # if not use the obtain_package_data method for sequential processing using the data_sources of the list
1923        else:
1924
1925            while len(package_names) > 0:
1926                
1927                package_name = package_names[0]
1928                package_data = self.fetch_package(package_name)
1929                if package_data is not None:
1930                    packages.append(package_data)
1931
1932                # Remove the package from the pending packages
1933                del package_names[0]
1934
1935                if progress_bar is not None:
1936                    progress_bar.update(1)
1937        
1938        self.logger.info(f"Total packages found: {len(packages)}")
1939        
1940        # update the self.packages attribute overwriting the packages with the same name
1941        # but conserving the other packages
1942        if extend:
1943            self.logger.info("Extending data source with obtained packages")
1944            for package in packages:
1945                self.packages[package.name] = package
1946
1947        return packages

Builds a list of Package objects using the data sources in order until one of them returns a valid package

Parameters
  • package_names (List[str]): List of package names
  • progress_bar (tqdm.tqdm): Progress bar to show the progress of the operation
  • extend (bool): If True, the packages will be added to the existing ones, otherwise, the existing ones will be replaced
Returns
  • List[Package]: List of Package objects
Examples
>>> packages = package_manager.obtain_packages(["package_name_1", "package_name_2"])
>>> packages
[<Package: package_name_1>, <Package: package_name_2>]
def get_package(self, package_name: str) -> Optional[olivia_finder.package.Package]:
1949    def get_package(self, package_name: str) -> Union[Package, None]:
1950        '''
1951        Obtain a package from the package manager
1952
1953        Parameters
1954        ----------
1955        package_name : str
1956            Name of the package
1957
1958        Returns
1959        -------
1960        Union[Package, None]
1961            Package object if the package exists, None otherwise
1962
1963        Examples
1964        --------
1965        >>> package = package_manager.get_package("package_name")
1966        >>> print(package.name)
1967        '''
1968        return self.packages.get(package_name, None)

Obtain a package from the package manager

Parameters
  • package_name (str): Name of the package
Returns
  • Union[Package, None]: Package object if the package exists, None otherwise
Examples
>>> package = package_manager.get_package("package_name")
>>> print(package.name)
def get_packages(self) -> List[olivia_finder.package.Package]:
1970    def get_packages(self) -> List[Package]:
1971        '''
1972        Obtain the list of packages of the package manager
1973
1974        Returns
1975        -------
1976        List[Package]
1977            List of packages of the package manager
1978
1979        Examples
1980        --------
1981        >>> package_list = package_manager.get_package_list()
1982        '''
1983        return list(self.packages.values())

Obtain the list of packages of the package manager

Returns
  • List[Package]: List of packages of the package manager
Examples
>>> package_list = package_manager.get_package_list()
def package_names(self) -> List[str]:
1985    def package_names(self) -> List[str]:
1986        '''
1987        Obtain the list of package names of the package manager
1988
1989        Returns
1990        -------
1991        List[str]
1992            List of package names of the package manager
1993
1994        Examples
1995        --------
1996        >>> package_names = package_manager.get_package_names()
1997        '''
1998        return list(self.packages.keys())

Obtain the list of package names of the package manager

Returns
  • List[str]: List of package names of the package manager
Examples
>>> package_names = package_manager.get_package_names()
def fetch_package_names(self) -> List[str]:
2000    def fetch_package_names(self) -> List[str]:
2001        '''
2002        Obtain the list of package names of the package manager
2003
2004        Returns
2005        -------
2006        List[str]
2007            List of package names of the package manager
2008
2009        Examples
2010        --------
2011        >>> package_names = package_manager.obtain_package_names()
2012        '''
2013
2014        return self.data_sources[0].obtain_package_names()

Obtain the list of package names of the package manager

Returns
  • List[str]: List of package names of the package manager
Examples
>>> package_names = package_manager.obtain_package_names()
def export_dataframe(self, full_data=False) -> pandas.core.frame.DataFrame:
2016    def export_dataframe(self, full_data = False) -> pd.DataFrame:
2017        '''
2018        Convert the object to a adjacency list, where each row represents a dependency
2019        If a package has'nt dependencies, it will appear in the list with dependency field empty
2020
2021        Parameters
2022        ----------
2023        full_data : bool, optional
2024            If True, the adjacency list will contain the version and url of the packages, by default False
2025
2026        Returns
2027        -------
2028        pd.DataFrame
2029            Dependency network as an adjacency list
2030
2031        Examples    
2032        --------
2033        >>> adj_list = package_manager.export_adjlist()
2034        >>> print(adj_list)
2035            [name, dependency]
2036        '''
2037
2038        if not self.packages:
2039            self.logger.debug("The package manager is empty")
2040            return pd.DataFrame()
2041                    
2042
2043        rows = []
2044
2045        if full_data:
2046            for package_name in self.packages.keys():
2047                package = self.get_package(package_name)
2048
2049
2050                for dependency in package.dependencies:
2051                    
2052                    try:
2053                        dependency_full = self.get_package(dependency.name)
2054                        rows.append(
2055                            [package.name, package.version, package.url, dependency_full.name, dependency_full.version, dependency_full.url]
2056                        )
2057                    except Exception:
2058                        if dependency.name is not None:
2059                            rows.append(
2060                                [package.name, package.version, package.url, dependency.name, None, None]
2061                            )
2062
2063
2064            return pd.DataFrame(rows, columns=['name', 'version', 'url', 'dependency', 'dependency_version', 'dependency_url'])
2065        else:
2066            for package_name in self.packages.keys():
2067                package = self.get_package(package_name)
2068                rows.extend(
2069                    [package.name, dependency.name]
2070                    for dependency in package.dependencies
2071                )
2072            return pd.DataFrame(rows, columns=['name', 'dependency'])

Convert the object to a adjacency list, where each row represents a dependency If a package has'nt dependencies, it will appear in the list with dependency field empty

Parameters
  • full_data (bool, optional): If True, the adjacency list will contain the version and url of the packages, by default False
Returns
  • pd.DataFrame: Dependency network as an adjacency list
Examples
>>> adj_list = package_manager.export_adjlist()
>>> print(adj_list)
    [name, dependency]
def get_adjlist( self, package_name: str, adjlist: Optional[Dict] = None, deep_level: int = 5) -> Dict[str, List[str]]:
2074    def get_adjlist(self, package_name: str, adjlist: Optional[Dict] = None, deep_level: int = 5) -> Dict[str, List[str]]:
2075        """
2076        Generates the dependency network of a package from the data source.
2077
2078        Parameters
2079        ----------
2080        package_name : str
2081            The name of the package to generate the dependency network
2082        adjlist : Optional[Dict], optional
2083            The dependency network of the package, by default None
2084        deep_level : int, optional
2085            The deep level of the dependency network, by default 5
2086
2087        Returns
2088        -------
2089        Dict[str, List[str]]
2090            The dependency network of the package
2091        """
2092
2093        # If the deep level is 0, we return the dependency network (Stop condition)
2094        if deep_level == 0:
2095            return adjlist
2096
2097        # If the dependency network is not specified, we create it (Initial case)
2098        if adjlist is None:
2099            adjlist = {}
2100
2101        # If the package is already in the dependency network, we return it (Stop condition)
2102        if package_name in adjlist:
2103            return adjlist
2104
2105        # Use the data of the package manager
2106        current_package = self.get_package(package_name)
2107        dependencies =  current_package.get_dependencies_names() if current_package is not None else []
2108
2109        # Get the dependencies of the package and add it to the dependency network if it is not already in it
2110        adjlist[package_name] = dependencies
2111
2112        # Append the dependencies of the package to the dependency network
2113        for dependency_name in dependencies:
2114
2115            if (dependency_name not in adjlist) and  (self.get_package(dependency_name) is not None):
2116
2117                adjlist = self.get_adjlist(
2118                    package_name = dependency_name, 
2119                    adjlist = adjlist, 
2120                    deep_level = deep_level - 1,
2121                )
2122
2123        return adjlist

Generates the dependency network of a package from the data source.

Parameters
  • package_name (str): The name of the package to generate the dependency network
  • adjlist (Optional[Dict], optional): The dependency network of the package, by default None
  • deep_level (int, optional): The deep level of the dependency network, by default 5
Returns
  • Dict[str, List[str]]: The dependency network of the package
def fetch_adjlist( self, package_name: str, deep_level: int = 5, adjlist: dict = None) -> Dict[str, List[str]]:
2125    def fetch_adjlist(self, package_name: str, deep_level: int = 5, adjlist: dict = None) -> Dict[str, List[str]]:
2126        """
2127        Generates the dependency network of a package from the data source.
2128
2129        Parameters
2130        ----------
2131        package_name : str
2132            The name of the package to generate the dependency network
2133        deep_level : int, optional
2134            The deep level of the dependency network, by default 5
2135        dependency_network : dict, optional
2136            The dependency network of the package
2137
2138        Returns
2139        -------
2140        Dict[str, List[str]]
2141            The dependency network of the package
2142        """
2143
2144        if adjlist is None:
2145            adjlist = {}
2146
2147        # If the deep level is 0, we return the adjacency list (Stop condition) 
2148        if deep_level == 0 or package_name in adjlist:
2149            return adjlist
2150
2151        dependencies = []
2152        try:
2153            current_package = self.fetch_package(package_name)
2154            dependencies = current_package.get_dependencies_names()
2155
2156        except Exception as e:
2157            self.logger.debug(f"Package {package_name} not found: {e}")
2158
2159        # Add the package to the adjacency list if it is not already in it
2160        adjlist[package_name] = dependencies
2161
2162        # Append the dependencies of the package to the adjacency list if they are not already in it
2163        for dependency_name in dependencies:
2164            if dependency_name not in adjlist:
2165                try:     
2166                    adjlist = self.fetch_adjlist(
2167                        package_name=dependency_name,                # The name of the dependency
2168                        deep_level=deep_level - 1,                   # The deep level is reduced by 1
2169                        adjlist=adjlist                              # The global adjacency list
2170                    )
2171                except Exception:
2172                    self.logger.debug(
2173                        f"The package {dependency_name}, as dependency of {package_name} does not exist in the data source"
2174                    )
2175
2176        return adjlist

Generates the dependency network of a package from the data source.

Parameters
  • package_name (str): The name of the package to generate the dependency network
  • deep_level (int, optional): The deep level of the dependency network, by default 5
  • dependency_network (dict, optional): The dependency network of the package
Returns
  • Dict[str, List[str]]: The dependency network of the package
def get_network_graph( self, chunk_size=1000000, source_field='dependency', target_field='name', filter_field=None, filter_value=None) -> networkx.classes.digraph.DiGraph:
2189    def get_network_graph(
2190            self, chunk_size = int(1e6), 
2191            source_field = "dependency", target_field = "name",
2192            filter_field=None, filter_value=None) -> nx.DiGraph:
2193        """
2194        Builds a dependency network graph from a dataframe of dependencies.
2195        The dataframe must have two columns: dependent and dependency.
2196
2197        Parameters
2198        ----------
2199        chunk_size : int
2200            Number of rows to process at a time
2201        source_field : str
2202            Name of the column containing the source node
2203        target_field : str
2204            Name of the column containing the target node
2205        filter_field : str, optional
2206            Name of the column to filter on, by default None
2207        filter_value : str, optional
2208            Value to filter on, by default None
2209
2210        Returns
2211        -------
2212        nx.DiGraph
2213            Directed graph of dependencies
2214        """
2215
2216
2217        # If the default dtasource is a CSV_Datasource, we use custom implementation
2218        defaul_datasource = self.__get_default_datasource()
2219        if isinstance(defaul_datasource, CSVDataSource):
2220            return nx.from_pandas_edgelist(
2221                defaul_datasource.data, source=source_field, 
2222                target=target_field, create_using=nx.DiGraph()
2223            )
2224
2225        # If the default datasource is not a CSV_Datasource, we use the default implementation
2226        df = self.export_dataframe()
2227        try:
2228            # New NetworkX directed Graph
2229            G = nx.DiGraph()
2230            
2231            for i in range(0, len(df), chunk_size):
2232                chunk = df.iloc[i:i+chunk_size]
2233                # Add dependencies from chunk to G
2234                G = self.__add_chunk(
2235                    chunk,
2236                    G,
2237                    filter_field=filter_field,
2238                    filter_value=filter_value
2239                )
2240        
2241            return G
2242        
2243        except Exception as e:
2244            print('\n', e)

Builds a dependency network graph from a dataframe of dependencies. The dataframe must have two columns: dependent and dependency.

Parameters
  • chunk_size (int): Number of rows to process at a time
  • source_field (str): Name of the column containing the source node
  • target_field (str): Name of the column containing the target node
  • filter_field (str, optional): Name of the column to filter on, by default None
  • filter_value (str, optional): Value to filter on, by default None
Returns
  • nx.DiGraph: Directed graph of dependencies
def get_transitive_network_graph( self, package_name: str, deep_level: int = 5, generate=False) -> networkx.classes.digraph.DiGraph:
2246    def get_transitive_network_graph(self, package_name: str, deep_level: int = 5, generate = False) -> nx.DiGraph:
2247        """
2248        Gets the transitive dependency network of a package as a NetworkX graph.
2249
2250        Parameters
2251        ----------
2252        package_name : str
2253            The name of the package to get the dependency network
2254        deep_level : int, optional
2255            The deep level of the dependency network, by default 5
2256        generate : bool, optional
2257            If True, the dependency network is generated from the data source, by default False
2258
2259        Returns
2260        -------
2261        nx.DiGraph
2262            The dependency network of the package
2263        """
2264
2265        if generate:
2266            # Get the dependency network from the data source
2267            dependency_network = self.fetch_adjlist(package_name=package_name, deep_level=deep_level, adjlist={})
2268
2269        else:
2270            # Get the dependency network from in-memory data
2271            dependency_network = self.get_adjlist(package_name=package_name, deep_level=deep_level)
2272
2273        # Create a NetworkX graph of the dependency network as (DEPENDENCY ---> PACKAGE)
2274        G = nx.DiGraph()
2275        for package_name, dependencies in dependency_network.items():
2276            for dependency_name in dependencies:
2277                G.add_edge(dependency_name, package_name)
2278
2279        return G

Gets the transitive dependency network of a package as a NetworkX graph.

Parameters
  • package_name (str): The name of the package to get the dependency network
  • deep_level (int, optional): The deep level of the dependency network, by default 5
  • generate (bool, optional): If True, the dependency network is generated from the data source, by default False
Returns
  • nx.DiGraph: The dependency network of the package
class PackageManagerLoadError(builtins.Exception):
2293class PackageManagerLoadError(Exception):
2294    """
2295    Exception raised when an error occurs while loading a package manager
2296
2297    Attributes
2298    ----------
2299    message : str
2300        Error message
2301    """
2302
2303    def __init__(self, message):
2304        self.message = message
2305        super().__init__(self.message)

Exception raised when an error occurs while loading a package manager

Attributes
  • message (str): Error message
PackageManagerLoadError(message)
2303    def __init__(self, message):
2304        self.message = message
2305        super().__init__(self.message)
Inherited Members
builtins.BaseException
with_traceback
class PackageManagerSaveError(builtins.Exception):
2307class PackageManagerSaveError(Exception):
2308    """
2309    Exception raised when an error occurs while saving a package manager
2310
2311    Attributes
2312    ----------
2313    message : str
2314        Error message
2315    """
2316
2317    def __init__(self, message):
2318        self.message = message
2319        super().__init__(self.message)

Exception raised when an error occurs while saving a package manager

Attributes
  • message (str): Error message
PackageManagerSaveError(message)
2317    def __init__(self, message):
2318        self.message = message
2319        super().__init__(self.message)
Inherited Members
builtins.BaseException
with_traceback