olivia_finder.package_manager
Initialize a package manager
Note:
Initialization based on a scraper-type datasource involves initializing the data prior to its use.
Initialization based on a CSV-type datasource already contains all the data and can be retrieved directly.
Loading from a persistence file implies that the file contains an object that has already been initialized or already contains data.
A bioconductor scraping based package manager
from olivia_finder.package_manager import PackageManager
bioconductor_pm_scraper = PackageManager(
data_sources=[ # List of data sources
BioconductorScraper(),
]
)
A cran package manager loaded from a csv file
cran_pm_csv = PackageManager(
data_sources=[ # List of data sources
CSVDataSource(
# Path to the CSV file
"aux_data/cran_adjlist_test.csv",
dependent_field="Project Name",
dependency_field="Dependency Name",
dependent_version_field="Version Number",
)
]
)
# Is needed to initialize the package manager to fill the package list with the csv data
cran_pm_csv.initialize(show_progress=True)
Loading packages: 100%|[32m██████████[0m| 275/275 [00:00<00:00, 729.91packages/s]
A pypi package manager loaded from persistence file
bioconductor_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/bioconductor_scraper.olvpm")
A Maven package manager loaded from librariesio api
maven_pm_libio = PackageManager(
data_sources=[ # List of data sources
LibrariesioDataSource(platform="maven")
]
)
For scraping-based datasources: Initialize the structure with the data of the selected sources
Note:
The automatic obtaining of bioconductor packages as mentioned above depends on Selenium, which requires a pre-installed browser in the system, in our case Firefox.
It is possible that if you are running this notebook from a third-party Jupyter server, do not have a browser available
As a solution to this problem it is proposed to use the package_names parameter, in this way we can add a list of packages and the process can be continued
# bioconductor_pm_scraper.initialize(show_progress=True)
Note: If we do not provide a list of packages it will be obtained automatically if that functionality is implemented in datasource
Initialization of the bioconductor package manager using package list
# Initialize the package list
bioconductor_package_list = []
with open('../results/package_lists/bioconductor_scraped.txt', 'r') as file:
bioconductor_package_list = file.read().splitlines()
# Initialize the package manager with the package list
bioconductor_pm_scraper.initialize(show_progress=True, package_names=bioconductor_package_list[:10])
Loading packages: 100%|[32m██████████[0m| 10/10 [00:06<00:00, 1.43packages/s]
Initialization of the Pypi package manager
pypi_pm_scraper = PackageManager(
data_sources=[ # List of data sources
PypiScraper(),
]
)
pypi_package_list = []
with open('../results/package_lists/pypi_scraped.txt', 'r') as file:
pypi_package_list = file.read().splitlines()
# Initialize the package manager
pypi_pm_scraper.initialize(show_progress=True, package_names=pypi_package_list[:10])
# Save the package manager
pypi_pm_scraper.save(path="aux_data/pypi_pm_scraper_test.olvpm")
Loading packages: 100%|[32m██████████[0m| 10/10 [00:01<00:00, 6.75packages/s]
Initialization of the npm package manager
# Initialize the package manager
npm_package_list = []
with open('../results/package_lists/npm_scraped.txt', 'r') as file:
npm_package_list = file.read().splitlines()
npm_pm_scraper = PackageManager(
data_sources=[ # List of data sources
NpmScraper(),
]
)
# Initialize the package manager
npm_pm_scraper.initialize(show_progress=True, package_names=npm_package_list[:10])
# Save the package manager
npm_pm_scraper.save(path="aux_data/npm_pm_scraper_test.olvpm")
Loading packages: 100%|[32m██████████[0m| 10/10 [00:02<00:00, 3.88packages/s]
And using a csv based package manager
cran_pm_csv.initialize(show_progress=True)
Loading packages: 100%|[32m██████████[0m| 275/275 [00:00<00:00, 675.72packages/s]
Persistence
Save the package manager
pypi_pm_scraper.save("aux_data/pypi_scraper_pm_saved.olvpm")
Load package manager from persistence file
from olivia_finder.package_manager import PackageManager
bioconductor_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/bioconductor_scraper.olvpm")
cran_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/cran_scraper.olvpm")
pypi_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/pypi_scraper.olvpm")
npm_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/npm_scraper.olvpm")
Package manager functionalities
List package names
bioconductor_pm_loaded.package_names()[300:320]
['CNVgears',
'CONSTANd',
'CTSV',
'CellNOptR',
'ChAMP',
'ChIPseqR',
'CiteFuse',
'Clonality',
'CopyNumberPlots',
'CytoGLMM',
'DEFormats',
'DEScan2',
'DEsingle',
'DMRcaller',
'DOSE',
'DSS',
'DelayedMatrixStats',
'DirichletMultinomial',
'EBImage',
'EDASeq']
pypi_pm_loaded.package_names()[300:320]
['adafruit-circuitpython-bh1750',
'adafruit-circuitpython-ble-beacon',
'adafruit-circuitpython-ble-eddystone',
'adafruit-circuitpython-bluefruitspi',
'adafruit-circuitpython-bno08x',
'adafruit-circuitpython-circuitplayground',
'adafruit-circuitpython-debug-i2c',
'adafruit-circuitpython-displayio-ssd1306',
'adafruit-circuitpython-ds18x20',
'adafruit-circuitpython-ens160',
'adafruit-circuitpython-fingerprint',
'adafruit-circuitpython-gc-iot-core',
'adafruit-circuitpython-hcsr04',
'adafruit-circuitpython-htu31d',
'adafruit-circuitpython-imageload',
'adafruit-circuitpython-itertools',
'adafruit-circuitpython-lis2mdl',
'adafruit-circuitpython-lps2x',
'adafruit-circuitpython-lsm9ds0',
'adafruit-circuitpython-max31855']
Obtaining package names from libraries io api is not suported
maven_pm_libio.package_names()
[]
Get the data as a dict usung datasource
maven_pm_libio.fetch_package("org.apache.commons:commons-lang3").to_dict()
{'name': 'org.apache.commons:commons-lang3',
'version': '3.9',
'url': 'https://repo1.maven.org/maven2/org/apache/commons/commons-lang3',
'dependencies': [{'name': 'org.openjdk.jmh:jmh-generator-annprocess',
'version': '1.25.2',
'url': None,
'dependencies': []},
{'name': 'org.openjdk.jmh:jmh-core',
'version': '1.25.2',
'url': None,
'dependencies': []},
{'name': 'org.easymock:easymock',
'version': '5.1.0',
'url': None,
'dependencies': []},
{'name': 'org.hamcrest:hamcrest',
'version': None,
'url': None,
'dependencies': []},
{'name': 'org.junit-pioneer:junit-pioneer',
'version': '2.0.1',
'url': None,
'dependencies': []},
{'name': 'org.junit.jupiter:junit-jupiter',
'version': '5.9.3',
'url': None,
'dependencies': []}]}
cran_pm_csv.get_package('nmfem').to_dict()
{'name': 'nmfem',
'version': '1.0.4',
'url': None,
'dependencies': [{'name': 'rmarkdown',
'version': None,
'url': None,
'dependencies': []},
{'name': 'testthat', 'version': None, 'url': None, 'dependencies': []},
{'name': 'knitr', 'version': None, 'url': None, 'dependencies': []},
{'name': 'tidyr', 'version': None, 'url': None, 'dependencies': []},
{'name': 'mixtools', 'version': None, 'url': None, 'dependencies': []},
{'name': 'd3heatmap', 'version': None, 'url': None, 'dependencies': []},
{'name': 'dplyr', 'version': None, 'url': None, 'dependencies': []},
{'name': 'plyr', 'version': None, 'url': None, 'dependencies': []},
{'name': 'R', 'version': None, 'url': None, 'dependencies': []}]}
Get a package from self data
cran_pm_loaded.get_package('A3')
<olivia_finder.package.Package at 0x7f3c3722fe20>
npm_pm_loaded.get_package("react").to_dict()
{'name': 'react',
'version': '18.2.0',
'url': 'https://www.npmjs.com/package/react',
'dependencies': [{'name': 'loose-envify',
'version': '^1.1.0',
'url': None,
'dependencies': []}]}
List package objects
len(npm_pm_loaded.package_names())
1919072
pypi_pm_loaded.get_packages()[300:320]
[<olivia_finder.package.Package at 0x7f3c58ea7ac0>,
<olivia_finder.package.Package at 0x7f3c58ea7be0>,
<olivia_finder.package.Package at 0x7f3c58ea7d00>,
<olivia_finder.package.Package at 0x7f3c58ea7e20>,
<olivia_finder.package.Package at 0x7f3c58ea7f40>,
<olivia_finder.package.Package at 0x7f3c590e80a0>,
<olivia_finder.package.Package at 0x7f3c590e81c0>,
<olivia_finder.package.Package at 0x7f3c590e82e0>,
<olivia_finder.package.Package at 0x7f3c590e83a0>,
<olivia_finder.package.Package at 0x7f3c590e8520>,
<olivia_finder.package.Package at 0x7f3c590e86a0>,
<olivia_finder.package.Package at 0x7f3c590e87c0>,
<olivia_finder.package.Package at 0x7f3c590e88e0>,
<olivia_finder.package.Package at 0x7f3c590e89a0>,
<olivia_finder.package.Package at 0x7f3c590e8b20>,
<olivia_finder.package.Package at 0x7f3c590e8c40>,
<olivia_finder.package.Package at 0x7f3c590e8d00>,
<olivia_finder.package.Package at 0x7f3c590e8e80>,
<olivia_finder.package.Package at 0x7f3c590e8fa0>,
<olivia_finder.package.Package at 0x7f3c590e90c0>]
Obtain dependency networks
Using the data previously obtained and that are already loaded in the structure
a4_network = bioconductor_pm_loaded.fetch_adjlist("a4")
a4_network
{'a4': ['a4Base', 'a4Preproc', 'a4Classif', 'a4Core', 'a4Reporting'],
'a4Base': ['a4Preproc',
'a4Core',
'methods',
'graphics',
'grid',
'Biobase',
'annaffy',
'mpm',
'genefilter',
'limma',
'multtest',
'glmnet',
'gplots'],
'a4Preproc': ['BiocGenerics', 'Biobase'],
'BiocGenerics': ['R', 'methods', 'utils', 'graphics', 'stats'],
'R': [],
'methods': [],
'utils': [],
'graphics': [],
'stats': [],
'Biobase': ['R', 'BiocGenerics', 'utils', 'methods'],
'a4Core': ['Biobase', 'glmnet', 'methods', 'stats'],
'glmnet': [],
'grid': [],
'annaffy': ['R',
'methods',
'Biobase',
'BiocManager',
'GO.db',
'AnnotationDbi',
'DBI'],
'BiocManager': [],
'GO.db': [],
'AnnotationDbi': ['R',
'methods',
'stats4',
'BiocGenerics',
'Biobase',
'IRanges',
'DBI',
'RSQLite',
'S4Vectors',
'stats',
'KEGGREST'],
'stats4': [],
'IRanges': ['R',
'methods',
'utils',
'stats',
'BiocGenerics',
'S4Vectors',
'stats4'],
'DBI': [],
'RSQLite': [],
'S4Vectors': ['R', 'methods', 'utils', 'stats', 'stats4', 'BiocGenerics'],
'KEGGREST': ['R', 'methods', 'httr', 'png', 'Biostrings'],
'mpm': [],
'genefilter': ['MatrixGenerics',
'AnnotationDbi',
'annotate',
'Biobase',
'graphics',
'methods',
'stats',
'survival',
'grDevices'],
'MatrixGenerics': ['matrixStats', 'methods'],
'matrixStats': [],
'annotate': ['R',
'AnnotationDbi',
'XML',
'Biobase',
'DBI',
'xtable',
'graphics',
'utils',
'stats',
'methods',
'BiocGenerics',
'httr'],
'XML': [],
'xtable': [],
'httr': [],
'survival': [],
'grDevices': [],
'limma': ['R', 'grDevices', 'graphics', 'stats', 'utils', 'methods'],
'multtest': ['R',
'methods',
'BiocGenerics',
'Biobase',
'survival',
'MASS',
'stats4'],
'MASS': [],
'gplots': [],
'a4Classif': ['a4Core',
'a4Preproc',
'methods',
'Biobase',
'ROCR',
'pamr',
'glmnet',
'varSelRF',
'utils',
'graphics',
'stats'],
'ROCR': [],
'pamr': [],
'varSelRF': [],
'a4Reporting': ['methods', 'xtable']}
Get transitive dependency network graph
commons_lang3_network = maven_pm_libio.get_transitive_network_graph("org.apache.commons:commons-lang3", generate=True)
commons_lang3_network
<networkx.classes.digraph.DiGraph at 0x7f3c67ee3d90>
# Draw the network
from matplotlib import patches
pos = nx.spring_layout(commons_lang3_network)
nx.draw(commons_lang3_network, pos, node_size=50, font_size=8)
nx.draw_networkx_nodes(commons_lang3_network, pos, nodelist=["org.apache.commons:commons-lang3"], node_size=100, node_color="r")
plt.title("org.apache.commons:commons-lang3 transitive network", fontsize=15)
# add legend for red node
red_patch = patches.Patch(color='red', label='org.apache.commons:commons-lang3')
plt.legend(handles=[red_patch])
plt.show()

Obtaining updated data
a4_network2 = bioconductor_pm_loaded.get_adjlist("a4")
a4_network2
{'a4': ['a4Base', 'a4Preproc', 'a4Classif', 'a4Core', 'a4Reporting'],
'a4Base': ['a4Preproc',
'a4Core',
'methods',
'graphics',
'grid',
'Biobase',
'annaffy',
'mpm',
'genefilter',
'limma',
'multtest',
'glmnet',
'gplots'],
'a4Preproc': ['BiocGenerics', 'Biobase'],
'BiocGenerics': ['R', 'methods', 'utils', 'graphics', 'stats'],
'Biobase': ['R', 'BiocGenerics', 'utils', 'methods'],
'a4Core': ['Biobase', 'glmnet', 'methods', 'stats'],
'annaffy': ['R',
'methods',
'Biobase',
'BiocManager',
'GO.db',
'AnnotationDbi',
'DBI'],
'AnnotationDbi': ['R',
'methods',
'utils',
'stats4',
'BiocGenerics',
'Biobase',
'IRanges',
'DBI',
'RSQLite',
'S4Vectors',
'stats',
'KEGGREST'],
'IRanges': ['R',
'methods',
'utils',
'stats',
'BiocGenerics',
'S4Vectors',
'stats4'],
'S4Vectors': ['R', 'methods', 'utils', 'stats', 'stats4', 'BiocGenerics'],
'KEGGREST': ['R', 'methods', 'httr', 'png', 'Biostrings'],
'genefilter': ['MatrixGenerics',
'AnnotationDbi',
'annotate',
'Biobase',
'graphics',
'methods',
'stats',
'survival',
'grDevices'],
'MatrixGenerics': ['matrixStats', 'methods'],
'annotate': ['R',
'AnnotationDbi',
'XML',
'Biobase',
'DBI',
'xtable',
'graphics',
'utils',
'stats',
'methods',
'BiocGenerics',
'httr'],
'limma': ['R', 'grDevices', 'graphics', 'stats', 'utils', 'methods'],
'multtest': ['R',
'methods',
'BiocGenerics',
'Biobase',
'survival',
'MASS',
'stats4'],
'a4Classif': ['a4Core',
'a4Preproc',
'methods',
'Biobase',
'ROCR',
'pamr',
'glmnet',
'varSelRF',
'utils',
'graphics',
'stats'],
'a4Reporting': ['methods', 'xtable']}
Note that some package managers use dependencies that are not found is their repositories, as is the case of the 'xable' package, which although it is not in bioconductor, is dependence on a bioconductor package
xtable_bioconductor = bioconductor_pm_scraper.fetch_package("xtable")
xtable_bioconductor
In concrete this package is in Cran
cran_pm = PackageManager(
data_sources=[ # List of data sources
CranScraper(),
]
)
cran_pm.fetch_package("xtable")
<olivia_finder.package.Package at 0x7f3c2a19d090>
To solve this incongruity, we can supply the packet manager the Datasource de Cran, such as auxiliary datasource in which to perform searches if data is not found in the main datasource
bioconductor_cran_pm = PackageManager(
# Name of the package manager
data_sources=[ # List of data sources
BioconductorScraper(),
CranScraper(),
]
)
bioconductor_cran_pm.fetch_package("xtable")
<olivia_finder.package.Package at 0x7f3c2a19c910>
In this way we can obtain the network of dependencies for a package recursively, now having access to packages and dependencies that are from CRAN repository
a4_network3 = bioconductor_cran_pm.get_adjlist("a4")
a4_network3
{'a4': []}
As you can see, we can get a more complete network when we combine datasources
It is necessary that there be compatibility as in the case of Bioconductor/CRAN
a4_network.keys() == a4_network2.keys()
False
print(len(a4_network.keys()))
print(len(a4_network2.keys()))
print(len(a4_network3.keys()))
42
18
1
Export the data
bioconductor_df = bioconductor_pm_loaded.export_dataframe(full_data=False)
#Export the dataframe to a csv file
bioconductor_df.to_csv("aux_data/bioconductor_adjlist_scraping.csv", index=False)
bioconductor_df
| name | dependency | |
|---|---|---|
| 0 | ABSSeq | R |
| 1 | ABSSeq | methods |
| 2 | ABSSeq | locfit |
| 3 | ABSSeq | limma |
| 4 | AMOUNTAIN | R |
| ... | ... | ... |
| 28322 | zenith | reshape2 |
| 28323 | zenith | progress |
| 28324 | zenith | utils |
| 28325 | zenith | Rdpack |
| 28326 | zenith | stats |
28327 rows × 2 columns
pypi_df = pypi_pm_loaded.export_dataframe(full_data=True)
pypi_df
| name | version | url | dependency | dependency_version | dependency_url | |
|---|---|---|---|---|---|---|
| 0 | 0x-sra-client | 4.0.0 | https://pypi.org/project/0x-sra-client/ | urllib3 | 2.0.2 | https://pypi.org/project/urllib3/ |
| 1 | 0x-sra-client | 4.0.0 | https://pypi.org/project/0x-sra-client/ | six | 1.16.0 | https://pypi.org/project/six/ |
| 2 | 0x-sra-client | 4.0.0 | https://pypi.org/project/0x-sra-client/ | certifi | 2022.12.7 | https://pypi.org/project/certifi/ |
| 3 | 0x-sra-client | 4.0.0 | https://pypi.org/project/0x-sra-client/ | python | None | None |
| 4 | 0x-sra-client | 4.0.0 | https://pypi.org/project/0x-sra-client/ | 0x | 0.1 | https://pypi.org/project/0x/ |
| ... | ... | ... | ... | ... | ... | ... |
| 933950 | zyfra-check | 0.0.9 | https://pypi.org/project/zyfra-check/ | pytest | 7.3.1 | https://pypi.org/project/pytest/ |
| 933951 | zyfra-check | 0.0.9 | https://pypi.org/project/zyfra-check/ | jira | 3.5.0 | https://pypi.org/project/jira/ |
| 933952 | zyfra-check | 0.0.9 | https://pypi.org/project/zyfra-check/ | testit | None | None |
| 933953 | zython | 0.4.1 | https://pypi.org/project/zython/ | wheel | 0.40.0 | https://pypi.org/project/wheel/ |
| 933954 | zython | 0.4.1 | https://pypi.org/project/zython/ | minizinc | 0.9.0 | https://pypi.org/project/minizinc/ |
933955 rows × 6 columns
npm_df = npm_pm_loaded.export_dataframe(full_data=True)
npm_df
| name | version | url | dependency | dependency_version | dependency_url | |
|---|---|---|---|---|---|---|
| 0 | --hoodmane-test-pyodide | 0.21.0 | https://www.npmjs.com/package/--hoodmane-test-... | base-64 | 1.0.0 | https://www.npmjs.com/package/base-64 |
| 1 | --hoodmane-test-pyodide | 0.21.0 | https://www.npmjs.com/package/--hoodmane-test-... | node-fetch | 3.3.1 | https://www.npmjs.com/package/node-fetch |
| 2 | --hoodmane-test-pyodide | 0.21.0 | https://www.npmjs.com/package/--hoodmane-test-... | ws | 8.13.0 | https://www.npmjs.com/package/ws |
| 3 | -lidonghui | 1.0.0 | https://www.npmjs.com/package/-lidonghui | axios | 1.4.0 | https://www.npmjs.com/package/axios |
| 4 | -lidonghui | 1.0.0 | https://www.npmjs.com/package/-lidonghui | commander | 10.0.1 | https://www.npmjs.com/package/commander |
| ... | ... | ... | ... | ... | ... | ... |
| 4855089 | zzzzz-first-module | 1.0.0 | https://www.npmjs.com/package/zzzzz-first-module | rxjs | 7.8.1 | https://www.npmjs.com/package/rxjs |
| 4855090 | zzzzz-first-module | 1.0.0 | https://www.npmjs.com/package/zzzzz-first-module | zone.js | 0.13.0 | https://www.npmjs.com/package/zone.js |
| 4855091 | zzzzzwszzzz | 1.0.0 | https://www.npmjs.com/package/zzzzzwszzzz | commander | 10.0.1 | https://www.npmjs.com/package/commander |
| 4855092 | zzzzzwszzzz | 1.0.0 | https://www.npmjs.com/package/zzzzzwszzzz | inquirer | 9.2.2 | https://www.npmjs.com/package/inquirer |
| 4855093 | zzzzzwszzzz | 1.0.0 | https://www.npmjs.com/package/zzzzzwszzzz | link | 1.5.1 | https://www.npmjs.com/package/link |
4855094 rows × 6 columns
Get Network graph
bioconductor_G = bioconductor_pm_loaded.get_network_graph()
bioconductor_G
<networkx.classes.digraph.DiGraph at 0x7f3c229451b0>
# Draw the graph
# ----------------
# Note:
# - Execution time can take a bit
pos = nx.spring_layout(bioconductor_G)
plt.figure(figsize=(10, 10))
nx.draw_networkx_nodes(bioconductor_G, pos, node_size=10, node_color="blue")
nx.draw_networkx_edges(bioconductor_G, pos, alpha=0.4, edge_color="black", width=0.1)
plt.title("Bioconductor network graph", fontsize=15)
plt.show()
Explore the data
We can appreciate the difference, as we explained before if we use a combined datasource
bioconductor_cran_pm = PackageManager(
data_sources=[BioconductorScraper(), CranScraper()]
)
a4_network_2 = bioconductor_cran_pm.fetch_adjlist("a4")
import json
print(json.dumps(a4_network_2, indent=4))
{
"a4": [
"a4Base",
"a4Preproc",
"a4Classif",
"a4Core",
"a4Reporting"
],
"a4Base": [
"a4Preproc",
"a4Core",
"methods",
"graphics",
"grid",
"Biobase",
"annaffy",
"mpm",
"genefilter",
"limma",
"multtest",
"glmnet",
"gplots"
],
"a4Preproc": [
"BiocGenerics",
"Biobase"
],
"BiocGenerics": [
"R",
"methods",
"utils",
"graphics",
"stats"
],
"R": [],
"methods": [],
"utils": [],
"graphics": [],
"stats": [],
"Biobase": [
"R",
"BiocGenerics",
"utils",
"methods"
],
"a4Core": [
"Biobase",
"glmnet",
"methods",
"stats"
],
"glmnet": [
"R",
"Matrix",
"methods",
"utils",
"foreach",
"shape",
"survival",
"Rcpp"
],
"Matrix": [
"R",
"methods",
"graphics",
"grid",
"lattice",
"stats",
"utils"
],
"foreach": [
"R",
"codetools",
"utils",
"iterators"
],
"shape": [
"R",
"stats",
"graphics",
"grDevices"
],
"survival": [
"R",
"graphics",
"Matrix",
"methods",
"splines",
"stats",
"utils"
],
"Rcpp": [
"methods",
"utils"
],
"grid": [],
"annaffy": [
"R",
"methods",
"Biobase",
"BiocManager",
"GO.db",
"AnnotationDbi",
"DBI"
],
"BiocManager": [
"utils"
],
"GO.db": [],
"AnnotationDbi": [
"R",
"methods",
"stats4",
"BiocGenerics",
"Biobase",
"IRanges",
"DBI",
"RSQLite",
"S4Vectors",
"stats",
"KEGGREST"
],
"stats4": [],
"IRanges": [
"R",
"methods",
"utils",
"stats",
"BiocGenerics",
"S4Vectors",
"stats4"
],
"DBI": [
"methods",
"R"
],
"RSQLite": [
"R",
"bit64",
"blob",
"DBI",
"memoise",
"methods",
"pkgconfig"
],
"S4Vectors": [
"R",
"methods",
"utils",
"stats",
"stats4",
"BiocGenerics"
],
"KEGGREST": [
"R",
"methods",
"httr",
"png",
"Biostrings"
],
"mpm": [
"R",
"MASS",
"KernSmooth"
],
"MASS": [
"R",
"grDevices",
"graphics",
"stats",
"utils",
"methods"
],
"grDevices": [],
"KernSmooth": [
"R",
"stats"
],
"genefilter": [
"MatrixGenerics",
"AnnotationDbi",
"annotate",
"Biobase",
"graphics",
"methods",
"stats",
"survival",
"grDevices"
],
"MatrixGenerics": [
"matrixStats",
"methods"
],
"matrixStats": [
"R"
],
"annotate": [
"R",
"AnnotationDbi",
"XML",
"Biobase",
"DBI",
"xtable",
"graphics",
"utils",
"stats",
"methods",
"BiocGenerics",
"httr"
],
"XML": [
"R",
"methods",
"utils"
],
"xtable": [
"R",
"stats",
"utils"
],
"httr": [
"R",
"curl",
"jsonlite",
"mime",
"openssl",
"R6"
],
"limma": [
"R",
"grDevices",
"graphics",
"stats",
"utils",
"methods"
],
"multtest": [
"R",
"methods",
"BiocGenerics",
"Biobase",
"survival",
"MASS",
"stats4"
],
"gplots": [
"R",
"gtools",
"stats",
"caTools",
"KernSmooth",
"methods"
],
"gtools": [
"methods",
"stats",
"utils"
],
"caTools": [
"R",
"bitops"
],
"bitops": [],
"a4Classif": [
"a4Core",
"a4Preproc",
"methods",
"Biobase",
"ROCR",
"pamr",
"glmnet",
"varSelRF",
"utils",
"graphics",
"stats"
],
"ROCR": [
"R",
"methods",
"graphics",
"grDevices",
"gplots",
"stats"
],
"pamr": [
"R",
"cluster",
"survival"
],
"cluster": [
"R",
"graphics",
"grDevices",
"stats",
"utils"
],
"varSelRF": [
"R",
"randomForest",
"parallel"
],
"randomForest": [
"R",
"stats"
],
"parallel": [],
"a4Reporting": [
"methods",
"xtable"
]
}
1''' 2 3## Initialize a package manager 4 5 6**Note:** 7 8Initialization based on a scraper-type datasource involves initializing the data prior to its use. 9 10Initialization based on a CSV-type datasource already contains all the data and can be retrieved directly. 11 12Loading from a persistence file implies that the file contains an object that has already been initialized or already contains data. 13 14A bioconductor scraping based package manager 15 16 17```python 18from olivia_finder.package_manager import PackageManager 19``` 20 21 22```python 23bioconductor_pm_scraper = PackageManager( 24 data_sources=[ # List of data sources 25 BioconductorScraper(), 26 ] 27) 28``` 29 30A cran package manager loaded from a csv file 31 32 33```python 34cran_pm_csv = PackageManager( 35 data_sources=[ # List of data sources 36 CSVDataSource( 37 # Path to the CSV file 38 "aux_data/cran_adjlist_test.csv", 39 dependent_field="Project Name", 40 dependency_field="Dependency Name", 41 dependent_version_field="Version Number", 42 ) 43 ] 44) 45 46# Is needed to initialize the package manager to fill the package list with the csv data 47cran_pm_csv.initialize(show_progress=True) 48``` 49 50 Loading packages: 100%|[32m██████████[0m| 275/275 [00:00<00:00, 729.91packages/s] 51 52 53A pypi package manager loaded from persistence file 54 55 56```python 57bioconductor_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/bioconductor_scraper.olvpm") 58``` 59 60A Maven package manager loaded from librariesio api 61 62 63```python 64maven_pm_libio = PackageManager( 65 data_sources=[ # List of data sources 66 LibrariesioDataSource(platform="maven") 67 ] 68) 69``` 70 71**For scraping-based datasources: Initialize the structure with the data of the selected sources** 72 73 74<span style="color:red">Note:</span> 75 76The automatic obtaining of bioconductor packages as mentioned above depends on Selenium, which requires a pre-installed browser in the system, in our case Firefox. 77 78It is possible that if you are running this notebook from a third-party Jupyter server, do not have a browser available 79 80As a solution to this problem it is proposed to use the package_names parameter, in this way we can add a list of packages and the process can be continued 81 82 83```python 84# bioconductor_pm_scraper.initialize(show_progress=True) 85``` 86 87Note: If we do not provide a list of packages it will be obtained automatically if that functionality is implemented in datasource 88 89Initialization of the bioconductor package manager using package list 90 91 92```python 93# Initialize the package list 94bioconductor_package_list = [] 95with open('../results/package_lists/bioconductor_scraped.txt', 'r') as file: 96 bioconductor_package_list = file.read().splitlines() 97 98# Initialize the package manager with the package list 99bioconductor_pm_scraper.initialize(show_progress=True, package_names=bioconductor_package_list[:10]) 100``` 101 102 Loading packages: 100%|[32m██████████[0m| 10/10 [00:06<00:00, 1.43packages/s] 103 104 105Initialization of the Pypi package manager 106 107 108 109```python 110pypi_pm_scraper = PackageManager( 111 data_sources=[ # List of data sources 112 PypiScraper(), 113 ] 114) 115 116pypi_package_list = [] 117with open('../results/package_lists/pypi_scraped.txt', 'r') as file: 118 pypi_package_list = file.read().splitlines() 119 120# Initialize the package manager 121pypi_pm_scraper.initialize(show_progress=True, package_names=pypi_package_list[:10]) 122 123# Save the package manager 124pypi_pm_scraper.save(path="aux_data/pypi_pm_scraper_test.olvpm") 125``` 126 127 Loading packages: 100%|[32m██████████[0m| 10/10 [00:01<00:00, 6.75packages/s] 128 129 130Initialization of the npm package manager 131 132 133```python 134# Initialize the package manager 135npm_package_list = [] 136with open('../results/package_lists/npm_scraped.txt', 'r') as file: 137 npm_package_list = file.read().splitlines() 138 139npm_pm_scraper = PackageManager( 140 data_sources=[ # List of data sources 141 NpmScraper(), 142 ] 143) 144 145# Initialize the package manager 146npm_pm_scraper.initialize(show_progress=True, package_names=npm_package_list[:10]) 147 148# Save the package manager 149npm_pm_scraper.save(path="aux_data/npm_pm_scraper_test.olvpm") 150``` 151 152 Loading packages: 100%|[32m██████████[0m| 10/10 [00:02<00:00, 3.88packages/s] 153 154 155And using a csv based package manager 156 157 158```python 159cran_pm_csv.initialize(show_progress=True) 160``` 161 162 Loading packages: 100%|[32m██████████[0m| 275/275 [00:00<00:00, 675.72packages/s] 163 164 165## Persistence 166 167 168**Save the package manager** 169 170 171 172```python 173pypi_pm_scraper.save("aux_data/pypi_scraper_pm_saved.olvpm") 174``` 175 176**Load package manager from persistence file** 177 178 179 180```python 181from olivia_finder.package_manager import PackageManager 182``` 183 184 185```python 186bioconductor_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/bioconductor_scraper.olvpm") 187``` 188 189 190```python 191cran_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/cran_scraper.olvpm") 192``` 193 194 195```python 196pypi_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/pypi_scraper.olvpm") 197``` 198 199 200```python 201npm_pm_loaded = PackageManager.load_from_persistence("../results/package_managers/npm_scraper.olvpm") 202``` 203 204## Package manager functionalities 205 206 207**List package names** 208 209 210 211```python 212bioconductor_pm_loaded.package_names()[300:320] 213``` 214 215 216 ['CNVgears', 217 'CONSTANd', 218 'CTSV', 219 'CellNOptR', 220 'ChAMP', 221 'ChIPseqR', 222 'CiteFuse', 223 'Clonality', 224 'CopyNumberPlots', 225 'CytoGLMM', 226 'DEFormats', 227 'DEScan2', 228 'DEsingle', 229 'DMRcaller', 230 'DOSE', 231 'DSS', 232 'DelayedMatrixStats', 233 'DirichletMultinomial', 234 'EBImage', 235 'EDASeq'] 236 237 238 239 240```python 241pypi_pm_loaded.package_names()[300:320] 242``` 243 244 245 246 247 ['adafruit-circuitpython-bh1750', 248 'adafruit-circuitpython-ble-beacon', 249 'adafruit-circuitpython-ble-eddystone', 250 'adafruit-circuitpython-bluefruitspi', 251 'adafruit-circuitpython-bno08x', 252 'adafruit-circuitpython-circuitplayground', 253 'adafruit-circuitpython-debug-i2c', 254 'adafruit-circuitpython-displayio-ssd1306', 255 'adafruit-circuitpython-ds18x20', 256 'adafruit-circuitpython-ens160', 257 'adafruit-circuitpython-fingerprint', 258 'adafruit-circuitpython-gc-iot-core', 259 'adafruit-circuitpython-hcsr04', 260 'adafruit-circuitpython-htu31d', 261 'adafruit-circuitpython-imageload', 262 'adafruit-circuitpython-itertools', 263 'adafruit-circuitpython-lis2mdl', 264 'adafruit-circuitpython-lps2x', 265 'adafruit-circuitpython-lsm9ds0', 266 'adafruit-circuitpython-max31855'] 267 268 269 270<span style="color: red"> Obtaining package names from libraries io api is not suported</span> 271 272 273```python 274maven_pm_libio.package_names() 275``` 276 277 278 279 280 [] 281 282 283 284**Get the data as a dict usung datasource** 285 286 287```python 288maven_pm_libio.fetch_package("org.apache.commons:commons-lang3").to_dict() 289``` 290 291 292 293 294 {'name': 'org.apache.commons:commons-lang3', 295 'version': '3.9', 296 'url': 'https://repo1.maven.org/maven2/org/apache/commons/commons-lang3', 297 'dependencies': [{'name': 'org.openjdk.jmh:jmh-generator-annprocess', 298 'version': '1.25.2', 299 'url': None, 300 'dependencies': []}, 301 {'name': 'org.openjdk.jmh:jmh-core', 302 'version': '1.25.2', 303 'url': None, 304 'dependencies': []}, 305 {'name': 'org.easymock:easymock', 306 'version': '5.1.0', 307 'url': None, 308 'dependencies': []}, 309 {'name': 'org.hamcrest:hamcrest', 310 'version': None, 311 'url': None, 312 'dependencies': []}, 313 {'name': 'org.junit-pioneer:junit-pioneer', 314 'version': '2.0.1', 315 'url': None, 316 'dependencies': []}, 317 {'name': 'org.junit.jupiter:junit-jupiter', 318 'version': '5.9.3', 319 'url': None, 320 'dependencies': []}]} 321 322 323 324 325```python 326cran_pm_csv.get_package('nmfem').to_dict() 327``` 328 329 330 331 332 {'name': 'nmfem', 333 'version': '1.0.4', 334 'url': None, 335 'dependencies': [{'name': 'rmarkdown', 336 'version': None, 337 'url': None, 338 'dependencies': []}, 339 {'name': 'testthat', 'version': None, 'url': None, 'dependencies': []}, 340 {'name': 'knitr', 'version': None, 'url': None, 'dependencies': []}, 341 {'name': 'tidyr', 'version': None, 'url': None, 'dependencies': []}, 342 {'name': 'mixtools', 'version': None, 'url': None, 'dependencies': []}, 343 {'name': 'd3heatmap', 'version': None, 'url': None, 'dependencies': []}, 344 {'name': 'dplyr', 'version': None, 'url': None, 'dependencies': []}, 345 {'name': 'plyr', 'version': None, 'url': None, 'dependencies': []}, 346 {'name': 'R', 'version': None, 'url': None, 'dependencies': []}]} 347 348 349 350**Get a package from self data** 351 352 353```python 354cran_pm_loaded.get_package('A3') 355``` 356 357 358 359 360 <olivia_finder.package.Package at 0x7f3c3722fe20> 361 362 363 364 365```python 366npm_pm_loaded.get_package("react").to_dict() 367``` 368 369 370 371 372 {'name': 'react', 373 'version': '18.2.0', 374 'url': 'https://www.npmjs.com/package/react', 375 'dependencies': [{'name': 'loose-envify', 376 'version': '^1.1.0', 377 'url': None, 378 'dependencies': []}]} 379 380 381 382**List package objects** 383 384 385 386```python 387len(npm_pm_loaded.package_names()) 388``` 389 390 391 392 393 1919072 394 395 396 397 398```python 399pypi_pm_loaded.get_packages()[300:320] 400``` 401 402 403 404 405 [<olivia_finder.package.Package at 0x7f3c58ea7ac0>, 406 <olivia_finder.package.Package at 0x7f3c58ea7be0>, 407 <olivia_finder.package.Package at 0x7f3c58ea7d00>, 408 <olivia_finder.package.Package at 0x7f3c58ea7e20>, 409 <olivia_finder.package.Package at 0x7f3c58ea7f40>, 410 <olivia_finder.package.Package at 0x7f3c590e80a0>, 411 <olivia_finder.package.Package at 0x7f3c590e81c0>, 412 <olivia_finder.package.Package at 0x7f3c590e82e0>, 413 <olivia_finder.package.Package at 0x7f3c590e83a0>, 414 <olivia_finder.package.Package at 0x7f3c590e8520>, 415 <olivia_finder.package.Package at 0x7f3c590e86a0>, 416 <olivia_finder.package.Package at 0x7f3c590e87c0>, 417 <olivia_finder.package.Package at 0x7f3c590e88e0>, 418 <olivia_finder.package.Package at 0x7f3c590e89a0>, 419 <olivia_finder.package.Package at 0x7f3c590e8b20>, 420 <olivia_finder.package.Package at 0x7f3c590e8c40>, 421 <olivia_finder.package.Package at 0x7f3c590e8d00>, 422 <olivia_finder.package.Package at 0x7f3c590e8e80>, 423 <olivia_finder.package.Package at 0x7f3c590e8fa0>, 424 <olivia_finder.package.Package at 0x7f3c590e90c0>] 425 426 427 428**Obtain dependency networks** 429 430Using the data previously obtained and that are already loaded in the structure 431 432 433```python 434a4_network = bioconductor_pm_loaded.fetch_adjlist("a4") 435a4_network 436``` 437 438 439 440 441 {'a4': ['a4Base', 'a4Preproc', 'a4Classif', 'a4Core', 'a4Reporting'], 442 'a4Base': ['a4Preproc', 443 'a4Core', 444 'methods', 445 'graphics', 446 'grid', 447 'Biobase', 448 'annaffy', 449 'mpm', 450 'genefilter', 451 'limma', 452 'multtest', 453 'glmnet', 454 'gplots'], 455 'a4Preproc': ['BiocGenerics', 'Biobase'], 456 'BiocGenerics': ['R', 'methods', 'utils', 'graphics', 'stats'], 457 'R': [], 458 'methods': [], 459 'utils': [], 460 'graphics': [], 461 'stats': [], 462 'Biobase': ['R', 'BiocGenerics', 'utils', 'methods'], 463 'a4Core': ['Biobase', 'glmnet', 'methods', 'stats'], 464 'glmnet': [], 465 'grid': [], 466 'annaffy': ['R', 467 'methods', 468 'Biobase', 469 'BiocManager', 470 'GO.db', 471 'AnnotationDbi', 472 'DBI'], 473 'BiocManager': [], 474 'GO.db': [], 475 'AnnotationDbi': ['R', 476 'methods', 477 'stats4', 478 'BiocGenerics', 479 'Biobase', 480 'IRanges', 481 'DBI', 482 'RSQLite', 483 'S4Vectors', 484 'stats', 485 'KEGGREST'], 486 'stats4': [], 487 'IRanges': ['R', 488 'methods', 489 'utils', 490 'stats', 491 'BiocGenerics', 492 'S4Vectors', 493 'stats4'], 494 'DBI': [], 495 'RSQLite': [], 496 'S4Vectors': ['R', 'methods', 'utils', 'stats', 'stats4', 'BiocGenerics'], 497 'KEGGREST': ['R', 'methods', 'httr', 'png', 'Biostrings'], 498 'mpm': [], 499 'genefilter': ['MatrixGenerics', 500 'AnnotationDbi', 501 'annotate', 502 'Biobase', 503 'graphics', 504 'methods', 505 'stats', 506 'survival', 507 'grDevices'], 508 'MatrixGenerics': ['matrixStats', 'methods'], 509 'matrixStats': [], 510 'annotate': ['R', 511 'AnnotationDbi', 512 'XML', 513 'Biobase', 514 'DBI', 515 'xtable', 516 'graphics', 517 'utils', 518 'stats', 519 'methods', 520 'BiocGenerics', 521 'httr'], 522 'XML': [], 523 'xtable': [], 524 'httr': [], 525 'survival': [], 526 'grDevices': [], 527 'limma': ['R', 'grDevices', 'graphics', 'stats', 'utils', 'methods'], 528 'multtest': ['R', 529 'methods', 530 'BiocGenerics', 531 'Biobase', 532 'survival', 533 'MASS', 534 'stats4'], 535 'MASS': [], 536 'gplots': [], 537 'a4Classif': ['a4Core', 538 'a4Preproc', 539 'methods', 540 'Biobase', 541 'ROCR', 542 'pamr', 543 'glmnet', 544 'varSelRF', 545 'utils', 546 'graphics', 547 'stats'], 548 'ROCR': [], 549 'pamr': [], 550 'varSelRF': [], 551 'a4Reporting': ['methods', 'xtable']} 552 553 554 555Get transitive dependency network graph 556 557 558```python 559commons_lang3_network = maven_pm_libio.get_transitive_network_graph("org.apache.commons:commons-lang3", generate=True) 560commons_lang3_network 561``` 562 563 564 565 566 <networkx.classes.digraph.DiGraph at 0x7f3c67ee3d90> 567 568 569 570 571```python 572# Draw the network 573from matplotlib import patches 574pos = nx.spring_layout(commons_lang3_network) 575nx.draw(commons_lang3_network, pos, node_size=50, font_size=8) 576 577nx.draw_networkx_nodes(commons_lang3_network, pos, nodelist=["org.apache.commons:commons-lang3"], node_size=100, node_color="r") 578plt.title("org.apache.commons:commons-lang3 transitive network", fontsize=15) 579# add legend for red node 580red_patch = patches.Patch(color='red', label='org.apache.commons:commons-lang3') 581plt.legend(handles=[red_patch]) 582plt.show() 583``` 584 585 586 587 588 589 590 591**Obtaining updated data** 592 593 594```python 595a4_network2 = bioconductor_pm_loaded.get_adjlist("a4") 596a4_network2 597``` 598 599 600 601 602 {'a4': ['a4Base', 'a4Preproc', 'a4Classif', 'a4Core', 'a4Reporting'], 603 'a4Base': ['a4Preproc', 604 'a4Core', 605 'methods', 606 'graphics', 607 'grid', 608 'Biobase', 609 'annaffy', 610 'mpm', 611 'genefilter', 612 'limma', 613 'multtest', 614 'glmnet', 615 'gplots'], 616 'a4Preproc': ['BiocGenerics', 'Biobase'], 617 'BiocGenerics': ['R', 'methods', 'utils', 'graphics', 'stats'], 618 'Biobase': ['R', 'BiocGenerics', 'utils', 'methods'], 619 'a4Core': ['Biobase', 'glmnet', 'methods', 'stats'], 620 'annaffy': ['R', 621 'methods', 622 'Biobase', 623 'BiocManager', 624 'GO.db', 625 'AnnotationDbi', 626 'DBI'], 627 'AnnotationDbi': ['R', 628 'methods', 629 'utils', 630 'stats4', 631 'BiocGenerics', 632 'Biobase', 633 'IRanges', 634 'DBI', 635 'RSQLite', 636 'S4Vectors', 637 'stats', 638 'KEGGREST'], 639 'IRanges': ['R', 640 'methods', 641 'utils', 642 'stats', 643 'BiocGenerics', 644 'S4Vectors', 645 'stats4'], 646 'S4Vectors': ['R', 'methods', 'utils', 'stats', 'stats4', 'BiocGenerics'], 647 'KEGGREST': ['R', 'methods', 'httr', 'png', 'Biostrings'], 648 'genefilter': ['MatrixGenerics', 649 'AnnotationDbi', 650 'annotate', 651 'Biobase', 652 'graphics', 653 'methods', 654 'stats', 655 'survival', 656 'grDevices'], 657 'MatrixGenerics': ['matrixStats', 'methods'], 658 'annotate': ['R', 659 'AnnotationDbi', 660 'XML', 661 'Biobase', 662 'DBI', 663 'xtable', 664 'graphics', 665 'utils', 666 'stats', 667 'methods', 668 'BiocGenerics', 669 'httr'], 670 'limma': ['R', 'grDevices', 'graphics', 'stats', 'utils', 'methods'], 671 'multtest': ['R', 672 'methods', 673 'BiocGenerics', 674 'Biobase', 675 'survival', 676 'MASS', 677 'stats4'], 678 'a4Classif': ['a4Core', 679 'a4Preproc', 680 'methods', 681 'Biobase', 682 'ROCR', 683 'pamr', 684 'glmnet', 685 'varSelRF', 686 'utils', 687 'graphics', 688 'stats'], 689 'a4Reporting': ['methods', 'xtable']} 690 691 692 693Note that some package managers use dependencies that are not found is their repositories, as is the case of the 'xable' package, which although it is not in bioconductor, is dependence on a bioconductor package 694 695 696```python 697xtable_bioconductor = bioconductor_pm_scraper.fetch_package("xtable") 698xtable_bioconductor 699``` 700 701In concrete this package is in Cran 702 703 704```python 705cran_pm = PackageManager( 706 data_sources=[ # List of data sources 707 CranScraper(), 708 ] 709) 710 711cran_pm.fetch_package("xtable") 712``` 713 714 715 716 717 <olivia_finder.package.Package at 0x7f3c2a19d090> 718 719 720 721To solve this incongruity, we can supply the packet manager the Datasource de Cran, such as auxiliary datasource in which to perform searches if data is not found in the main datasource 722 723 724```python 725bioconductor_cran_pm = PackageManager( 726 # Name of the package manager 727 data_sources=[ # List of data sources 728 BioconductorScraper(), 729 CranScraper(), 730 ] 731) 732 733bioconductor_cran_pm.fetch_package("xtable") 734``` 735 736 737 738 739 <olivia_finder.package.Package at 0x7f3c2a19c910> 740 741 742 743In this way we can obtain the network of dependencies for a package recursively, now having access to packages and dependencies that are from CRAN repository 744 745 746```python 747a4_network3 = bioconductor_cran_pm.get_adjlist("a4") 748a4_network3 749``` 750 751 752 753 754 {'a4': []} 755 756 757 758As you can see, we can get a more complete network when we combine datasources 759 760It is necessary that there be compatibility as in the case of Bioconductor/CRAN 761 762 763```python 764a4_network.keys() == a4_network2.keys() 765``` 766 767 768 769 770 False 771 772 773 774 775```python 776print(len(a4_network.keys())) 777print(len(a4_network2.keys())) 778print(len(a4_network3.keys())) 779``` 780 781 42 782 18 783 1 784 785 786## Export the data 787 788 789```python 790bioconductor_df = bioconductor_pm_loaded.export_dataframe(full_data=False) 791 792#Export the dataframe to a csv file 793bioconductor_df.to_csv("aux_data/bioconductor_adjlist_scraping.csv", index=False) 794bioconductor_df 795``` 796 797 798 799 800<div> 801<style scoped> 802 .dataframe tbody tr th:only-of-type { 803 vertical-align: middle; 804 } 805 806 .dataframe tbody tr th { 807 vertical-align: top; 808 } 809 810 .dataframe thead th { 811 text-align: right; 812 } 813</style> 814<table border="1" class="dataframe"> 815 <thead> 816 <tr style="text-align: right;"> 817 <th></th> 818 <th>name</th> 819 <th>dependency</th> 820 </tr> 821 </thead> 822 <tbody> 823 <tr> 824 <th>0</th> 825 <td>ABSSeq</td> 826 <td>R</td> 827 </tr> 828 <tr> 829 <th>1</th> 830 <td>ABSSeq</td> 831 <td>methods</td> 832 </tr> 833 <tr> 834 <th>2</th> 835 <td>ABSSeq</td> 836 <td>locfit</td> 837 </tr> 838 <tr> 839 <th>3</th> 840 <td>ABSSeq</td> 841 <td>limma</td> 842 </tr> 843 <tr> 844 <th>4</th> 845 <td>AMOUNTAIN</td> 846 <td>R</td> 847 </tr> 848 <tr> 849 <th>...</th> 850 <td>...</td> 851 <td>...</td> 852 </tr> 853 <tr> 854 <th>28322</th> 855 <td>zenith</td> 856 <td>reshape2</td> 857 </tr> 858 <tr> 859 <th>28323</th> 860 <td>zenith</td> 861 <td>progress</td> 862 </tr> 863 <tr> 864 <th>28324</th> 865 <td>zenith</td> 866 <td>utils</td> 867 </tr> 868 <tr> 869 <th>28325</th> 870 <td>zenith</td> 871 <td>Rdpack</td> 872 </tr> 873 <tr> 874 <th>28326</th> 875 <td>zenith</td> 876 <td>stats</td> 877 </tr> 878 </tbody> 879</table> 880<p>28327 rows × 2 columns</p> 881</div> 882 883 884 885 886```python 887pypi_df = pypi_pm_loaded.export_dataframe(full_data=True) 888pypi_df 889``` 890 891 892 893 894<div> 895<style scoped> 896 .dataframe tbody tr th:only-of-type { 897 vertical-align: middle; 898 } 899 900 .dataframe tbody tr th { 901 vertical-align: top; 902 } 903 904 .dataframe thead th { 905 text-align: right; 906 } 907</style> 908<table border="1" class="dataframe"> 909 <thead> 910 <tr style="text-align: right;"> 911 <th></th> 912 <th>name</th> 913 <th>version</th> 914 <th>url</th> 915 <th>dependency</th> 916 <th>dependency_version</th> 917 <th>dependency_url</th> 918 </tr> 919 </thead> 920 <tbody> 921 <tr> 922 <th>0</th> 923 <td>0x-sra-client</td> 924 <td>4.0.0</td> 925 <td>https://pypi.org/project/0x-sra-client/</td> 926 <td>urllib3</td> 927 <td>2.0.2</td> 928 <td>https://pypi.org/project/urllib3/</td> 929 </tr> 930 <tr> 931 <th>1</th> 932 <td>0x-sra-client</td> 933 <td>4.0.0</td> 934 <td>https://pypi.org/project/0x-sra-client/</td> 935 <td>six</td> 936 <td>1.16.0</td> 937 <td>https://pypi.org/project/six/</td> 938 </tr> 939 <tr> 940 <th>2</th> 941 <td>0x-sra-client</td> 942 <td>4.0.0</td> 943 <td>https://pypi.org/project/0x-sra-client/</td> 944 <td>certifi</td> 945 <td>2022.12.7</td> 946 <td>https://pypi.org/project/certifi/</td> 947 </tr> 948 <tr> 949 <th>3</th> 950 <td>0x-sra-client</td> 951 <td>4.0.0</td> 952 <td>https://pypi.org/project/0x-sra-client/</td> 953 <td>python</td> 954 <td>None</td> 955 <td>None</td> 956 </tr> 957 <tr> 958 <th>4</th> 959 <td>0x-sra-client</td> 960 <td>4.0.0</td> 961 <td>https://pypi.org/project/0x-sra-client/</td> 962 <td>0x</td> 963 <td>0.1</td> 964 <td>https://pypi.org/project/0x/</td> 965 </tr> 966 <tr> 967 <th>...</th> 968 <td>...</td> 969 <td>...</td> 970 <td>...</td> 971 <td>...</td> 972 <td>...</td> 973 <td>...</td> 974 </tr> 975 <tr> 976 <th>933950</th> 977 <td>zyfra-check</td> 978 <td>0.0.9</td> 979 <td>https://pypi.org/project/zyfra-check/</td> 980 <td>pytest</td> 981 <td>7.3.1</td> 982 <td>https://pypi.org/project/pytest/</td> 983 </tr> 984 <tr> 985 <th>933951</th> 986 <td>zyfra-check</td> 987 <td>0.0.9</td> 988 <td>https://pypi.org/project/zyfra-check/</td> 989 <td>jira</td> 990 <td>3.5.0</td> 991 <td>https://pypi.org/project/jira/</td> 992 </tr> 993 <tr> 994 <th>933952</th> 995 <td>zyfra-check</td> 996 <td>0.0.9</td> 997 <td>https://pypi.org/project/zyfra-check/</td> 998 <td>testit</td> 999 <td>None</td> 1000 <td>None</td> 1001 </tr> 1002 <tr> 1003 <th>933953</th> 1004 <td>zython</td> 1005 <td>0.4.1</td> 1006 <td>https://pypi.org/project/zython/</td> 1007 <td>wheel</td> 1008 <td>0.40.0</td> 1009 <td>https://pypi.org/project/wheel/</td> 1010 </tr> 1011 <tr> 1012 <th>933954</th> 1013 <td>zython</td> 1014 <td>0.4.1</td> 1015 <td>https://pypi.org/project/zython/</td> 1016 <td>minizinc</td> 1017 <td>0.9.0</td> 1018 <td>https://pypi.org/project/minizinc/</td> 1019 </tr> 1020 </tbody> 1021</table> 1022<p>933955 rows × 6 columns</p> 1023</div> 1024 1025 1026 1027 1028```python 1029npm_df = npm_pm_loaded.export_dataframe(full_data=True) 1030npm_df 1031``` 1032 1033 1034 1035 1036<div> 1037<style scoped> 1038 .dataframe tbody tr th:only-of-type { 1039 vertical-align: middle; 1040 } 1041 1042 .dataframe tbody tr th { 1043 vertical-align: top; 1044 } 1045 1046 .dataframe thead th { 1047 text-align: right; 1048 } 1049</style> 1050<table border="1" class="dataframe"> 1051 <thead> 1052 <tr style="text-align: right;"> 1053 <th></th> 1054 <th>name</th> 1055 <th>version</th> 1056 <th>url</th> 1057 <th>dependency</th> 1058 <th>dependency_version</th> 1059 <th>dependency_url</th> 1060 </tr> 1061 </thead> 1062 <tbody> 1063 <tr> 1064 <th>0</th> 1065 <td>--hoodmane-test-pyodide</td> 1066 <td>0.21.0</td> 1067 <td>https://www.npmjs.com/package/--hoodmane-test-...</td> 1068 <td>base-64</td> 1069 <td>1.0.0</td> 1070 <td>https://www.npmjs.com/package/base-64</td> 1071 </tr> 1072 <tr> 1073 <th>1</th> 1074 <td>--hoodmane-test-pyodide</td> 1075 <td>0.21.0</td> 1076 <td>https://www.npmjs.com/package/--hoodmane-test-...</td> 1077 <td>node-fetch</td> 1078 <td>3.3.1</td> 1079 <td>https://www.npmjs.com/package/node-fetch</td> 1080 </tr> 1081 <tr> 1082 <th>2</th> 1083 <td>--hoodmane-test-pyodide</td> 1084 <td>0.21.0</td> 1085 <td>https://www.npmjs.com/package/--hoodmane-test-...</td> 1086 <td>ws</td> 1087 <td>8.13.0</td> 1088 <td>https://www.npmjs.com/package/ws</td> 1089 </tr> 1090 <tr> 1091 <th>3</th> 1092 <td>-lidonghui</td> 1093 <td>1.0.0</td> 1094 <td>https://www.npmjs.com/package/-lidonghui</td> 1095 <td>axios</td> 1096 <td>1.4.0</td> 1097 <td>https://www.npmjs.com/package/axios</td> 1098 </tr> 1099 <tr> 1100 <th>4</th> 1101 <td>-lidonghui</td> 1102 <td>1.0.0</td> 1103 <td>https://www.npmjs.com/package/-lidonghui</td> 1104 <td>commander</td> 1105 <td>10.0.1</td> 1106 <td>https://www.npmjs.com/package/commander</td> 1107 </tr> 1108 <tr> 1109 <th>...</th> 1110 <td>...</td> 1111 <td>...</td> 1112 <td>...</td> 1113 <td>...</td> 1114 <td>...</td> 1115 <td>...</td> 1116 </tr> 1117 <tr> 1118 <th>4855089</th> 1119 <td>zzzzz-first-module</td> 1120 <td>1.0.0</td> 1121 <td>https://www.npmjs.com/package/zzzzz-first-module</td> 1122 <td>rxjs</td> 1123 <td>7.8.1</td> 1124 <td>https://www.npmjs.com/package/rxjs</td> 1125 </tr> 1126 <tr> 1127 <th>4855090</th> 1128 <td>zzzzz-first-module</td> 1129 <td>1.0.0</td> 1130 <td>https://www.npmjs.com/package/zzzzz-first-module</td> 1131 <td>zone.js</td> 1132 <td>0.13.0</td> 1133 <td>https://www.npmjs.com/package/zone.js</td> 1134 </tr> 1135 <tr> 1136 <th>4855091</th> 1137 <td>zzzzzwszzzz</td> 1138 <td>1.0.0</td> 1139 <td>https://www.npmjs.com/package/zzzzzwszzzz</td> 1140 <td>commander</td> 1141 <td>10.0.1</td> 1142 <td>https://www.npmjs.com/package/commander</td> 1143 </tr> 1144 <tr> 1145 <th>4855092</th> 1146 <td>zzzzzwszzzz</td> 1147 <td>1.0.0</td> 1148 <td>https://www.npmjs.com/package/zzzzzwszzzz</td> 1149 <td>inquirer</td> 1150 <td>9.2.2</td> 1151 <td>https://www.npmjs.com/package/inquirer</td> 1152 </tr> 1153 <tr> 1154 <th>4855093</th> 1155 <td>zzzzzwszzzz</td> 1156 <td>1.0.0</td> 1157 <td>https://www.npmjs.com/package/zzzzzwszzzz</td> 1158 <td>link</td> 1159 <td>1.5.1</td> 1160 <td>https://www.npmjs.com/package/link</td> 1161 </tr> 1162 </tbody> 1163</table> 1164<p>4855094 rows × 6 columns</p> 1165</div> 1166 1167 1168 1169**Get Network graph** 1170 1171 1172```python 1173bioconductor_G = bioconductor_pm_loaded.get_network_graph() 1174bioconductor_G 1175``` 1176 1177 1178 1179 1180 <networkx.classes.digraph.DiGraph at 0x7f3c229451b0> 1181 1182 1183 1184 1185```python 1186# Draw the graph 1187# ---------------- 1188# Note: 1189# - Execution time can take a bit 1190 1191pos = nx.spring_layout(bioconductor_G) 1192plt.figure(figsize=(10, 10)) 1193nx.draw_networkx_nodes(bioconductor_G, pos, node_size=10, node_color="blue") 1194nx.draw_networkx_edges(bioconductor_G, pos, alpha=0.4, edge_color="black", width=0.1) 1195plt.title("Bioconductor network graph", fontsize=15) 1196plt.show() 1197``` 1198 1199 1200 1201## Explore the data 1202 1203 1204We can appreciate the difference, as we explained before if we use a combined datasource 1205 1206 1207```python 1208bioconductor_cran_pm = PackageManager( 1209 data_sources=[BioconductorScraper(), CranScraper()] 1210) 1211 1212a4_network_2 = bioconductor_cran_pm.fetch_adjlist("a4") 1213``` 1214 1215 1216```python 1217import json 1218print(json.dumps(a4_network_2, indent=4)) 1219``` 1220 1221 { 1222 "a4": [ 1223 "a4Base", 1224 "a4Preproc", 1225 "a4Classif", 1226 "a4Core", 1227 "a4Reporting" 1228 ], 1229 "a4Base": [ 1230 "a4Preproc", 1231 "a4Core", 1232 "methods", 1233 "graphics", 1234 "grid", 1235 "Biobase", 1236 "annaffy", 1237 "mpm", 1238 "genefilter", 1239 "limma", 1240 "multtest", 1241 "glmnet", 1242 "gplots" 1243 ], 1244 "a4Preproc": [ 1245 "BiocGenerics", 1246 "Biobase" 1247 ], 1248 "BiocGenerics": [ 1249 "R", 1250 "methods", 1251 "utils", 1252 "graphics", 1253 "stats" 1254 ], 1255 "R": [], 1256 "methods": [], 1257 "utils": [], 1258 "graphics": [], 1259 "stats": [], 1260 "Biobase": [ 1261 "R", 1262 "BiocGenerics", 1263 "utils", 1264 "methods" 1265 ], 1266 "a4Core": [ 1267 "Biobase", 1268 "glmnet", 1269 "methods", 1270 "stats" 1271 ], 1272 "glmnet": [ 1273 "R", 1274 "Matrix", 1275 "methods", 1276 "utils", 1277 "foreach", 1278 "shape", 1279 "survival", 1280 "Rcpp" 1281 ], 1282 "Matrix": [ 1283 "R", 1284 "methods", 1285 "graphics", 1286 "grid", 1287 "lattice", 1288 "stats", 1289 "utils" 1290 ], 1291 "foreach": [ 1292 "R", 1293 "codetools", 1294 "utils", 1295 "iterators" 1296 ], 1297 "shape": [ 1298 "R", 1299 "stats", 1300 "graphics", 1301 "grDevices" 1302 ], 1303 "survival": [ 1304 "R", 1305 "graphics", 1306 "Matrix", 1307 "methods", 1308 "splines", 1309 "stats", 1310 "utils" 1311 ], 1312 "Rcpp": [ 1313 "methods", 1314 "utils" 1315 ], 1316 "grid": [], 1317 "annaffy": [ 1318 "R", 1319 "methods", 1320 "Biobase", 1321 "BiocManager", 1322 "GO.db", 1323 "AnnotationDbi", 1324 "DBI" 1325 ], 1326 "BiocManager": [ 1327 "utils" 1328 ], 1329 "GO.db": [], 1330 "AnnotationDbi": [ 1331 "R", 1332 "methods", 1333 "stats4", 1334 "BiocGenerics", 1335 "Biobase", 1336 "IRanges", 1337 "DBI", 1338 "RSQLite", 1339 "S4Vectors", 1340 "stats", 1341 "KEGGREST" 1342 ], 1343 "stats4": [], 1344 "IRanges": [ 1345 "R", 1346 "methods", 1347 "utils", 1348 "stats", 1349 "BiocGenerics", 1350 "S4Vectors", 1351 "stats4" 1352 ], 1353 "DBI": [ 1354 "methods", 1355 "R" 1356 ], 1357 "RSQLite": [ 1358 "R", 1359 "bit64", 1360 "blob", 1361 "DBI", 1362 "memoise", 1363 "methods", 1364 "pkgconfig" 1365 ], 1366 "S4Vectors": [ 1367 "R", 1368 "methods", 1369 "utils", 1370 "stats", 1371 "stats4", 1372 "BiocGenerics" 1373 ], 1374 "KEGGREST": [ 1375 "R", 1376 "methods", 1377 "httr", 1378 "png", 1379 "Biostrings" 1380 ], 1381 "mpm": [ 1382 "R", 1383 "MASS", 1384 "KernSmooth" 1385 ], 1386 "MASS": [ 1387 "R", 1388 "grDevices", 1389 "graphics", 1390 "stats", 1391 "utils", 1392 "methods" 1393 ], 1394 "grDevices": [], 1395 "KernSmooth": [ 1396 "R", 1397 "stats" 1398 ], 1399 "genefilter": [ 1400 "MatrixGenerics", 1401 "AnnotationDbi", 1402 "annotate", 1403 "Biobase", 1404 "graphics", 1405 "methods", 1406 "stats", 1407 "survival", 1408 "grDevices" 1409 ], 1410 "MatrixGenerics": [ 1411 "matrixStats", 1412 "methods" 1413 ], 1414 "matrixStats": [ 1415 "R" 1416 ], 1417 "annotate": [ 1418 "R", 1419 "AnnotationDbi", 1420 "XML", 1421 "Biobase", 1422 "DBI", 1423 "xtable", 1424 "graphics", 1425 "utils", 1426 "stats", 1427 "methods", 1428 "BiocGenerics", 1429 "httr" 1430 ], 1431 "XML": [ 1432 "R", 1433 "methods", 1434 "utils" 1435 ], 1436 "xtable": [ 1437 "R", 1438 "stats", 1439 "utils" 1440 ], 1441 "httr": [ 1442 "R", 1443 "curl", 1444 "jsonlite", 1445 "mime", 1446 "openssl", 1447 "R6" 1448 ], 1449 "limma": [ 1450 "R", 1451 "grDevices", 1452 "graphics", 1453 "stats", 1454 "utils", 1455 "methods" 1456 ], 1457 "multtest": [ 1458 "R", 1459 "methods", 1460 "BiocGenerics", 1461 "Biobase", 1462 "survival", 1463 "MASS", 1464 "stats4" 1465 ], 1466 "gplots": [ 1467 "R", 1468 "gtools", 1469 "stats", 1470 "caTools", 1471 "KernSmooth", 1472 "methods" 1473 ], 1474 "gtools": [ 1475 "methods", 1476 "stats", 1477 "utils" 1478 ], 1479 "caTools": [ 1480 "R", 1481 "bitops" 1482 ], 1483 "bitops": [], 1484 "a4Classif": [ 1485 "a4Core", 1486 "a4Preproc", 1487 "methods", 1488 "Biobase", 1489 "ROCR", 1490 "pamr", 1491 "glmnet", 1492 "varSelRF", 1493 "utils", 1494 "graphics", 1495 "stats" 1496 ], 1497 "ROCR": [ 1498 "R", 1499 "methods", 1500 "graphics", 1501 "grDevices", 1502 "gplots", 1503 "stats" 1504 ], 1505 "pamr": [ 1506 "R", 1507 "cluster", 1508 "survival" 1509 ], 1510 "cluster": [ 1511 "R", 1512 "graphics", 1513 "grDevices", 1514 "stats", 1515 "utils" 1516 ], 1517 "varSelRF": [ 1518 "R", 1519 "randomForest", 1520 "parallel" 1521 ], 1522 "randomForest": [ 1523 "R", 1524 "stats" 1525 ], 1526 "parallel": [], 1527 "a4Reporting": [ 1528 "methods", 1529 "xtable" 1530 ] 1531 } 1532 1533 1534''' 1535 1536from __future__ import annotations 1537from typing import Dict, List, Optional, Union 1538import pickle 1539import tqdm 1540import pandas as pd 1541import networkx as nx 1542 1543from .utilities.config import Configuration 1544from .myrequests.request_handler import RequestHandler 1545from .utilities.logger import MyLogger 1546from .data_source.data_source import DataSource 1547from .data_source.scraper_ds import ScraperDataSource 1548from .data_source.csv_ds import CSVDataSource 1549from .data_source.librariesio_ds import LibrariesioDataSource 1550from .data_source.repository_scrapers.github import GithubScraper 1551from .package import Package 1552 1553 1554class PackageManager(): 1555 ''' 1556 Class that represents a package manager, which provides a way to obtain packages from a data source and store them 1557 in a dictionary 1558 ''' 1559 1560 def __init__(self, data_sources: Optional[List[DataSource]] = None): 1561 ''' 1562 Constructor of the PackageManager class 1563 1564 Parameters 1565 ---------- 1566 1567 data_sources : Optional[List[DataSource]] 1568 List of data sources to obtain the packages, if None, an empty list will be used 1569 1570 Raises 1571 ------ 1572 ValueError 1573 If the data_sources parameter is None or empty 1574 1575 Examples 1576 -------- 1577 >>> package_manager = PackageManager("My package manager", [CSVDataSource("csv_data_source", "path/to/file.csv")]) 1578 ''' 1579 1580 if not data_sources: 1581 raise ValueError("Data source cannot be empty") 1582 1583 self.data_sources: List[DataSource] = data_sources 1584 self.packages: Dict[str, Package] = {} 1585 1586 # Init the logger for the package manager 1587 self.logger = MyLogger.get_logger('logger_packagemanager') 1588 1589 1590 def save(self, path: str): 1591 ''' 1592 Saves the package manager to a file, normally it has the extension .olvpm for easy identification 1593 as an Olivia package manager file 1594 1595 Parameters 1596 ---------- 1597 path : str 1598 Path of the file to save the package manager 1599 ''' 1600 1601 # Remove redundant objects 1602 for data_source in self.data_sources: 1603 if isinstance(data_source, ScraperDataSource): 1604 try: 1605 del data_source.request_handler 1606 except AttributeError: 1607 pass 1608 1609 try: 1610 1611 # Use pickle to save the package manager 1612 with open(path, "wb") as f: 1613 pickle.dump(self, f, protocol=pickle.HIGHEST_PROTOCOL) 1614 1615 except Exception as e: 1616 raise PackageManagerSaveError(f"Error saving package manager: {e}") from e 1617 1618 @classmethod 1619 def load_from_persistence(cls, path: str): 1620 '''_fro 1621 Load the package manager from a file, the file must have been created with the save method 1622 Normally, it has the extension .olvpm 1623 1624 Parameters 1625 ---------- 1626 path : str 1627 Path of the file to load the package manager 1628 1629 Returns 1630 ------- 1631 Union[PackageManager, None] 1632 PackageManager object if the file exists and is valid, None otherwise 1633 ''' 1634 1635 # Init the logger for the package manager 1636 logger = MyLogger.get_logger("logger_packagemanager") 1637 1638 # Try to load the package manager from the file 1639 try: 1640 # Use pickle to load the package manager 1641 logger.info(f"Loading package manager from {path}") 1642 with open(path, "rb") as f: 1643 obj = pickle.load(f) 1644 logger.info("Package manager loaded") 1645 except PackageManagerLoadError: 1646 logger.error(f"Error loading package manager from {path}") 1647 return None 1648 1649 if not isinstance(obj, PackageManager): 1650 return None 1651 1652 # Set the request handler for the scraper data sources 1653 for data_source in obj.data_sources: 1654 if isinstance(data_source, ScraperDataSource): 1655 data_source.request_handler = RequestHandler() 1656 # Set the logger for the scraper data source 1657 data_source.logger = MyLogger.get_logger("logger_datasource") 1658 1659 obj.logger = logger 1660 1661 return obj 1662 1663 @classmethod 1664 def load_from_csv( 1665 cls, 1666 csv_path: str, 1667 dependent_field: Optional[str] = None, 1668 dependency_field: Optional[str] = None, 1669 version_field: Optional[str] = None, 1670 dependency_version_field: Optional[str] = None, 1671 url_field: Optional[str] = None, 1672 default_format: Optional[bool] = False, 1673 ) -> PackageManager: 1674 ''' 1675 Load a csv file into a PackageManager object 1676 1677 Parameters 1678 ---------- 1679 csv_path : str 1680 Path of the csv file to load 1681 dependent_field : str = None, optional 1682 Name of the dependent field, by default None 1683 dependency_field : str = None, optional 1684 Name of the dependency field, by default None 1685 version_field : str = None, optional 1686 Name of the version field, by default None 1687 dependency_version_field : str = None, optional 1688 Name of the dependency version field, by default None 1689 url_field : str = None, optional 1690 Name of the url field, by default None 1691 default_format : bool, optional 1692 If True, the csv has the structure of full_adjlist.csv, by default False 1693 1694 Examples 1695 -------- 1696 >>> pm = PackageManager.load_csv_adjlist( 1697 "full_adjlist.csv", 1698 dependent_field="dependent", 1699 dependency_field="dependency", 1700 version_field="version", 1701 dependency_version_field="dependency_version", 1702 url_field="url" 1703 ) 1704 >>> pm = PackageManager.load_csv_adjlist("full_adjlist.csv", default_format=True) 1705 1706 ''' 1707 1708 # Init the logger for the package manager 1709 logger = MyLogger.get_logger('logger_packagemanager') 1710 1711 try: 1712 logger.info(f"Loading csv file from {csv_path}") 1713 data = pd.read_csv(csv_path) 1714 except Exception as e: 1715 logger.error(f"Error loading csv file: {e}") 1716 raise PackageManagerLoadError(f"Error loading csv file: {e}") from e 1717 1718 csv_fields = [] 1719 1720 if default_format: 1721 # If the csv has the structure of full_adjlist.csv, we use the default fields 1722 dependent_field = 'name' 1723 dependency_field = 'dependency' 1724 version_field = 'version' 1725 dependency_version_field = 'dependency_version' 1726 url_field = 'url' 1727 csv_fields = [dependent_field, dependency_field, 1728 version_field, dependency_version_field, url_field] 1729 else: 1730 if dependent_field is None or dependency_field is None: 1731 raise PackageManagerLoadError( 1732 "Dependent and dependency fields must be specified") 1733 1734 csv_fields = [dependent_field, dependency_field] 1735 # If the optional fields are specified, we add them to the list 1736 if version_field is not None: 1737 csv_fields.append(version_field) 1738 if dependency_version_field is not None: 1739 csv_fields.append(dependency_version_field) 1740 if url_field is not None: 1741 csv_fields.append(url_field) 1742 1743 # If the csv does not have the specified fields, we raise an error 1744 if any(col not in data.columns for col in csv_fields): 1745 logger.error("Invalid csv format") 1746 raise PackageManagerLoadError("Invalid csv format") 1747 1748 # We create the data source 1749 data_source = CSVDataSource( 1750 file_path=csv_path, 1751 dependent_field=dependent_field, 1752 dependency_field=dependency_field, 1753 dependent_version_field=version_field, 1754 dependency_version_field=dependency_version_field, 1755 dependent_url_field=url_field 1756 ) 1757 1758 obj = cls([data_source]) 1759 1760 # Add the logger to the package manager 1761 obj.logger = logger 1762 1763 # return the package manager 1764 return obj 1765 1766 def initialize( 1767 self, 1768 package_names: Optional[List[str]] = None, 1769 show_progress: Optional[bool] = False, 1770 chunk_size: Optional[int] = 10000): 1771 ''' 1772 Initializes the package manager by loading the packages from the data source 1773 1774 Parameters 1775 ---------- 1776 package_list : List[str] 1777 List of package names to load, if None, all the packages will be loaded 1778 show_progress : bool 1779 If True, a progress bar will be shown 1780 chunk_size : int 1781 Size of the chunks to load the packages, this is done to avoid memory errors 1782 1783 .. warning:: 1784 1785 For large package lists, this method can take a long time to complete 1786 1787 ''' 1788 1789 # Get package names from the data sources if needed 1790 if package_names is None: 1791 for data_source in self.data_sources: 1792 try: 1793 package_names = data_source.obtain_package_names() 1794 break 1795 except NotImplementedError as e: 1796 self.logger.debug(f"Data source {data_source} does not implement obtain_package_names method: {e}") 1797 continue 1798 except Exception as e: 1799 self.logger.error(f"Error while obtaining package names from data source: {e}") 1800 continue 1801 1802 # Check if the package names are valid 1803 if package_names is None or not isinstance(package_names, list): 1804 raise ValueError("No valid package names found") 1805 1806 # Instantiate the progress bar if needed 1807 progress_bar = tqdm.tqdm( 1808 total=len(package_names), 1809 colour="green", 1810 desc="Loading packages", 1811 unit="packages", 1812 ) if show_progress else None 1813 1814 # Create a chunked list of package names 1815 # This is done to avoid memory errors 1816 package_names_chunked = [package_names[i:i + chunk_size] for i in range(0, len(package_names), chunk_size)] 1817 1818 for package_names in package_names_chunked: 1819 # Obtain the packages data from the data source and store them 1820 self.fetch_packages( 1821 package_names=package_names, 1822 progress_bar=progress_bar, 1823 extend=True 1824 ) 1825 1826 # Close the progress bar if needed 1827 if progress_bar is not None: 1828 progress_bar.close() 1829 1830 def fetch_package(self, package_name: str) -> Union[Package, None]: 1831 ''' 1832 Builds a Package object using the data sources in order until one of them returns a valid package 1833 1834 Parameters 1835 ---------- 1836 package_name : str 1837 Name of the package 1838 1839 Returns 1840 ------- 1841 Union[Package, None] 1842 Package object if the package exists, None otherwise 1843 1844 Examples 1845 -------- 1846 >>> package = package_manager.obtain_package("package_name") 1847 >>> package 1848 <Package: package_name> 1849 ''' 1850 # Obtain the package data from the data sources in order 1851 package_data = None 1852 for data_source in self.data_sources: 1853 1854 if isinstance(data_source, (GithubScraper, CSVDataSource, ScraperDataSource, LibrariesioDataSource)): 1855 package_data = data_source.obtain_package_data(package_name) 1856 else: 1857 package_data = self.get_package(package_name).to_dict() 1858 1859 if package_data is not None: 1860 self.logger.debug(f"Package {package_name} found using {data_source.__class__.__name__}") 1861 break 1862 else: 1863 self.logger.debug(f"Package {package_name} not found using {data_source.__class__.__name__}") 1864 1865 1866 # Return the package if it exists 1867 return None if package_data is None else Package.load(package_data) 1868 1869 def fetch_packages( 1870 self, 1871 package_names: List[str], 1872 progress_bar: Optional[tqdm.tqdm], 1873 extend: bool = False 1874 ) -> List[Package]: 1875 ''' 1876 Builds a list of Package objects using the data sources in order until one of them returns a valid package 1877 1878 Parameters 1879 ---------- 1880 package_names : List[str] 1881 List of package names 1882 progress_bar : tqdm.tqdm 1883 Progress bar to show the progress of the operation 1884 extend : bool 1885 If True, the packages will be added to the existing ones, otherwise, the existing ones will be replaced 1886 1887 Returns 1888 ------- 1889 List[Package] 1890 List of Package objects 1891 1892 Examples 1893 -------- 1894 >>> packages = package_manager.obtain_packages(["package_name_1", "package_name_2"]) 1895 >>> packages 1896 [<Package: package_name_1>, <Package: package_name_2>] 1897 ''' 1898 1899 # Check if the package names are valid 1900 if not isinstance(package_names, list): 1901 raise ValueError("Package names must be a list") 1902 1903 preferred_data_source = self.data_sources[0] 1904 1905 # Return list 1906 packages = [] 1907 1908 # if datasource is instance of ScraperDataSource use the obtain_packages_data method for parallelization 1909 if isinstance(preferred_data_source, ScraperDataSource): 1910 1911 packages_data = [] 1912 data_found, not_found = preferred_data_source.obtain_packages_data( 1913 package_names=package_names, 1914 progress_bar=progress_bar # type: ignore 1915 ) 1916 packages_data.extend(data_found) 1917 # pending_packages = not_found 1918 self.logger.info(f"Packages found: {len(data_found)}, Packages not found: {len(not_found)}") 1919 packages = [Package.load(package_data) for package_data in packages_data] 1920 1921 # if not use the obtain_package_data method for sequential processing using the data_sources of the list 1922 else: 1923 1924 while len(package_names) > 0: 1925 1926 package_name = package_names[0] 1927 package_data = self.fetch_package(package_name) 1928 if package_data is not None: 1929 packages.append(package_data) 1930 1931 # Remove the package from the pending packages 1932 del package_names[0] 1933 1934 if progress_bar is not None: 1935 progress_bar.update(1) 1936 1937 self.logger.info(f"Total packages found: {len(packages)}") 1938 1939 # update the self.packages attribute overwriting the packages with the same name 1940 # but conserving the other packages 1941 if extend: 1942 self.logger.info("Extending data source with obtained packages") 1943 for package in packages: 1944 self.packages[package.name] = package 1945 1946 return packages 1947 1948 def get_package(self, package_name: str) -> Union[Package, None]: 1949 ''' 1950 Obtain a package from the package manager 1951 1952 Parameters 1953 ---------- 1954 package_name : str 1955 Name of the package 1956 1957 Returns 1958 ------- 1959 Union[Package, None] 1960 Package object if the package exists, None otherwise 1961 1962 Examples 1963 -------- 1964 >>> package = package_manager.get_package("package_name") 1965 >>> print(package.name) 1966 ''' 1967 return self.packages.get(package_name, None) 1968 1969 def get_packages(self) -> List[Package]: 1970 ''' 1971 Obtain the list of packages of the package manager 1972 1973 Returns 1974 ------- 1975 List[Package] 1976 List of packages of the package manager 1977 1978 Examples 1979 -------- 1980 >>> package_list = package_manager.get_package_list() 1981 ''' 1982 return list(self.packages.values()) 1983 1984 def package_names(self) -> List[str]: 1985 ''' 1986 Obtain the list of package names of the package manager 1987 1988 Returns 1989 ------- 1990 List[str] 1991 List of package names of the package manager 1992 1993 Examples 1994 -------- 1995 >>> package_names = package_manager.get_package_names() 1996 ''' 1997 return list(self.packages.keys()) 1998 1999 def fetch_package_names(self) -> List[str]: 2000 ''' 2001 Obtain the list of package names of the package manager 2002 2003 Returns 2004 ------- 2005 List[str] 2006 List of package names of the package manager 2007 2008 Examples 2009 -------- 2010 >>> package_names = package_manager.obtain_package_names() 2011 ''' 2012 2013 return self.data_sources[0].obtain_package_names() 2014 2015 def export_dataframe(self, full_data = False) -> pd.DataFrame: 2016 ''' 2017 Convert the object to a adjacency list, where each row represents a dependency 2018 If a package has'nt dependencies, it will appear in the list with dependency field empty 2019 2020 Parameters 2021 ---------- 2022 full_data : bool, optional 2023 If True, the adjacency list will contain the version and url of the packages, by default False 2024 2025 Returns 2026 ------- 2027 pd.DataFrame 2028 Dependency network as an adjacency list 2029 2030 Examples 2031 -------- 2032 >>> adj_list = package_manager.export_adjlist() 2033 >>> print(adj_list) 2034 [name, dependency] 2035 ''' 2036 2037 if not self.packages: 2038 self.logger.debug("The package manager is empty") 2039 return pd.DataFrame() 2040 2041 2042 rows = [] 2043 2044 if full_data: 2045 for package_name in self.packages.keys(): 2046 package = self.get_package(package_name) 2047 2048 2049 for dependency in package.dependencies: 2050 2051 try: 2052 dependency_full = self.get_package(dependency.name) 2053 rows.append( 2054 [package.name, package.version, package.url, dependency_full.name, dependency_full.version, dependency_full.url] 2055 ) 2056 except Exception: 2057 if dependency.name is not None: 2058 rows.append( 2059 [package.name, package.version, package.url, dependency.name, None, None] 2060 ) 2061 2062 2063 return pd.DataFrame(rows, columns=['name', 'version', 'url', 'dependency', 'dependency_version', 'dependency_url']) 2064 else: 2065 for package_name in self.packages.keys(): 2066 package = self.get_package(package_name) 2067 rows.extend( 2068 [package.name, dependency.name] 2069 for dependency in package.dependencies 2070 ) 2071 return pd.DataFrame(rows, columns=['name', 'dependency']) 2072 2073 def get_adjlist(self, package_name: str, adjlist: Optional[Dict] = None, deep_level: int = 5) -> Dict[str, List[str]]: 2074 """ 2075 Generates the dependency network of a package from the data source. 2076 2077 Parameters 2078 ---------- 2079 package_name : str 2080 The name of the package to generate the dependency network 2081 adjlist : Optional[Dict], optional 2082 The dependency network of the package, by default None 2083 deep_level : int, optional 2084 The deep level of the dependency network, by default 5 2085 2086 Returns 2087 ------- 2088 Dict[str, List[str]] 2089 The dependency network of the package 2090 """ 2091 2092 # If the deep level is 0, we return the dependency network (Stop condition) 2093 if deep_level == 0: 2094 return adjlist 2095 2096 # If the dependency network is not specified, we create it (Initial case) 2097 if adjlist is None: 2098 adjlist = {} 2099 2100 # If the package is already in the dependency network, we return it (Stop condition) 2101 if package_name in adjlist: 2102 return adjlist 2103 2104 # Use the data of the package manager 2105 current_package = self.get_package(package_name) 2106 dependencies = current_package.get_dependencies_names() if current_package is not None else [] 2107 2108 # Get the dependencies of the package and add it to the dependency network if it is not already in it 2109 adjlist[package_name] = dependencies 2110 2111 # Append the dependencies of the package to the dependency network 2112 for dependency_name in dependencies: 2113 2114 if (dependency_name not in adjlist) and (self.get_package(dependency_name) is not None): 2115 2116 adjlist = self.get_adjlist( 2117 package_name = dependency_name, 2118 adjlist = adjlist, 2119 deep_level = deep_level - 1, 2120 ) 2121 2122 return adjlist 2123 2124 def fetch_adjlist(self, package_name: str, deep_level: int = 5, adjlist: dict = None) -> Dict[str, List[str]]: 2125 """ 2126 Generates the dependency network of a package from the data source. 2127 2128 Parameters 2129 ---------- 2130 package_name : str 2131 The name of the package to generate the dependency network 2132 deep_level : int, optional 2133 The deep level of the dependency network, by default 5 2134 dependency_network : dict, optional 2135 The dependency network of the package 2136 2137 Returns 2138 ------- 2139 Dict[str, List[str]] 2140 The dependency network of the package 2141 """ 2142 2143 if adjlist is None: 2144 adjlist = {} 2145 2146 # If the deep level is 0, we return the adjacency list (Stop condition) 2147 if deep_level == 0 or package_name in adjlist: 2148 return adjlist 2149 2150 dependencies = [] 2151 try: 2152 current_package = self.fetch_package(package_name) 2153 dependencies = current_package.get_dependencies_names() 2154 2155 except Exception as e: 2156 self.logger.debug(f"Package {package_name} not found: {e}") 2157 2158 # Add the package to the adjacency list if it is not already in it 2159 adjlist[package_name] = dependencies 2160 2161 # Append the dependencies of the package to the adjacency list if they are not already in it 2162 for dependency_name in dependencies: 2163 if dependency_name not in adjlist: 2164 try: 2165 adjlist = self.fetch_adjlist( 2166 package_name=dependency_name, # The name of the dependency 2167 deep_level=deep_level - 1, # The deep level is reduced by 1 2168 adjlist=adjlist # The global adjacency list 2169 ) 2170 except Exception: 2171 self.logger.debug( 2172 f"The package {dependency_name}, as dependency of {package_name} does not exist in the data source" 2173 ) 2174 2175 return adjlist 2176 2177 def __add_chunk(self, 2178 df, G, 2179 filter_field=None, 2180 filter_value=None 2181 ): 2182 2183 filtered = df[df[filter_field] == filter_value] if filter_field else df 2184 links = list(zip(filtered["name"], filtered["dependency"])) 2185 G.add_edges_from(links) 2186 return G 2187 2188 def get_network_graph( 2189 self, chunk_size = int(1e6), 2190 source_field = "dependency", target_field = "name", 2191 filter_field=None, filter_value=None) -> nx.DiGraph: 2192 """ 2193 Builds a dependency network graph from a dataframe of dependencies. 2194 The dataframe must have two columns: dependent and dependency. 2195 2196 Parameters 2197 ---------- 2198 chunk_size : int 2199 Number of rows to process at a time 2200 source_field : str 2201 Name of the column containing the source node 2202 target_field : str 2203 Name of the column containing the target node 2204 filter_field : str, optional 2205 Name of the column to filter on, by default None 2206 filter_value : str, optional 2207 Value to filter on, by default None 2208 2209 Returns 2210 ------- 2211 nx.DiGraph 2212 Directed graph of dependencies 2213 """ 2214 2215 2216 # If the default dtasource is a CSV_Datasource, we use custom implementation 2217 defaul_datasource = self.__get_default_datasource() 2218 if isinstance(defaul_datasource, CSVDataSource): 2219 return nx.from_pandas_edgelist( 2220 defaul_datasource.data, source=source_field, 2221 target=target_field, create_using=nx.DiGraph() 2222 ) 2223 2224 # If the default datasource is not a CSV_Datasource, we use the default implementation 2225 df = self.export_dataframe() 2226 try: 2227 # New NetworkX directed Graph 2228 G = nx.DiGraph() 2229 2230 for i in range(0, len(df), chunk_size): 2231 chunk = df.iloc[i:i+chunk_size] 2232 # Add dependencies from chunk to G 2233 G = self.__add_chunk( 2234 chunk, 2235 G, 2236 filter_field=filter_field, 2237 filter_value=filter_value 2238 ) 2239 2240 return G 2241 2242 except Exception as e: 2243 print('\n', e) 2244 2245 def get_transitive_network_graph(self, package_name: str, deep_level: int = 5, generate = False) -> nx.DiGraph: 2246 """ 2247 Gets the transitive dependency network of a package as a NetworkX graph. 2248 2249 Parameters 2250 ---------- 2251 package_name : str 2252 The name of the package to get the dependency network 2253 deep_level : int, optional 2254 The deep level of the dependency network, by default 5 2255 generate : bool, optional 2256 If True, the dependency network is generated from the data source, by default False 2257 2258 Returns 2259 ------- 2260 nx.DiGraph 2261 The dependency network of the package 2262 """ 2263 2264 if generate: 2265 # Get the dependency network from the data source 2266 dependency_network = self.fetch_adjlist(package_name=package_name, deep_level=deep_level, adjlist={}) 2267 2268 else: 2269 # Get the dependency network from in-memory data 2270 dependency_network = self.get_adjlist(package_name=package_name, deep_level=deep_level) 2271 2272 # Create a NetworkX graph of the dependency network as (DEPENDENCY ---> PACKAGE) 2273 G = nx.DiGraph() 2274 for package_name, dependencies in dependency_network.items(): 2275 for dependency_name in dependencies: 2276 G.add_edge(dependency_name, package_name) 2277 2278 return G 2279 2280 def __get_default_datasource(self): 2281 """ 2282 Gets the default data source 2283 2284 Returns 2285 ------- 2286 DataSource 2287 The default data source 2288 """ 2289 2290 return self.data_sources[0] if len(self.data_sources) > 0 else None 2291 2292class PackageManagerLoadError(Exception): 2293 """ 2294 Exception raised when an error occurs while loading a package manager 2295 2296 Attributes 2297 ---------- 2298 message : str 2299 Error message 2300 """ 2301 2302 def __init__(self, message): 2303 self.message = message 2304 super().__init__(self.message) 2305 2306class PackageManagerSaveError(Exception): 2307 """ 2308 Exception raised when an error occurs while saving a package manager 2309 2310 Attributes 2311 ---------- 2312 message : str 2313 Error message 2314 """ 2315 2316 def __init__(self, message): 2317 self.message = message 2318 super().__init__(self.message)
1555class PackageManager(): 1556 ''' 1557 Class that represents a package manager, which provides a way to obtain packages from a data source and store them 1558 in a dictionary 1559 ''' 1560 1561 def __init__(self, data_sources: Optional[List[DataSource]] = None): 1562 ''' 1563 Constructor of the PackageManager class 1564 1565 Parameters 1566 ---------- 1567 1568 data_sources : Optional[List[DataSource]] 1569 List of data sources to obtain the packages, if None, an empty list will be used 1570 1571 Raises 1572 ------ 1573 ValueError 1574 If the data_sources parameter is None or empty 1575 1576 Examples 1577 -------- 1578 >>> package_manager = PackageManager("My package manager", [CSVDataSource("csv_data_source", "path/to/file.csv")]) 1579 ''' 1580 1581 if not data_sources: 1582 raise ValueError("Data source cannot be empty") 1583 1584 self.data_sources: List[DataSource] = data_sources 1585 self.packages: Dict[str, Package] = {} 1586 1587 # Init the logger for the package manager 1588 self.logger = MyLogger.get_logger('logger_packagemanager') 1589 1590 1591 def save(self, path: str): 1592 ''' 1593 Saves the package manager to a file, normally it has the extension .olvpm for easy identification 1594 as an Olivia package manager file 1595 1596 Parameters 1597 ---------- 1598 path : str 1599 Path of the file to save the package manager 1600 ''' 1601 1602 # Remove redundant objects 1603 for data_source in self.data_sources: 1604 if isinstance(data_source, ScraperDataSource): 1605 try: 1606 del data_source.request_handler 1607 except AttributeError: 1608 pass 1609 1610 try: 1611 1612 # Use pickle to save the package manager 1613 with open(path, "wb") as f: 1614 pickle.dump(self, f, protocol=pickle.HIGHEST_PROTOCOL) 1615 1616 except Exception as e: 1617 raise PackageManagerSaveError(f"Error saving package manager: {e}") from e 1618 1619 @classmethod 1620 def load_from_persistence(cls, path: str): 1621 '''_fro 1622 Load the package manager from a file, the file must have been created with the save method 1623 Normally, it has the extension .olvpm 1624 1625 Parameters 1626 ---------- 1627 path : str 1628 Path of the file to load the package manager 1629 1630 Returns 1631 ------- 1632 Union[PackageManager, None] 1633 PackageManager object if the file exists and is valid, None otherwise 1634 ''' 1635 1636 # Init the logger for the package manager 1637 logger = MyLogger.get_logger("logger_packagemanager") 1638 1639 # Try to load the package manager from the file 1640 try: 1641 # Use pickle to load the package manager 1642 logger.info(f"Loading package manager from {path}") 1643 with open(path, "rb") as f: 1644 obj = pickle.load(f) 1645 logger.info("Package manager loaded") 1646 except PackageManagerLoadError: 1647 logger.error(f"Error loading package manager from {path}") 1648 return None 1649 1650 if not isinstance(obj, PackageManager): 1651 return None 1652 1653 # Set the request handler for the scraper data sources 1654 for data_source in obj.data_sources: 1655 if isinstance(data_source, ScraperDataSource): 1656 data_source.request_handler = RequestHandler() 1657 # Set the logger for the scraper data source 1658 data_source.logger = MyLogger.get_logger("logger_datasource") 1659 1660 obj.logger = logger 1661 1662 return obj 1663 1664 @classmethod 1665 def load_from_csv( 1666 cls, 1667 csv_path: str, 1668 dependent_field: Optional[str] = None, 1669 dependency_field: Optional[str] = None, 1670 version_field: Optional[str] = None, 1671 dependency_version_field: Optional[str] = None, 1672 url_field: Optional[str] = None, 1673 default_format: Optional[bool] = False, 1674 ) -> PackageManager: 1675 ''' 1676 Load a csv file into a PackageManager object 1677 1678 Parameters 1679 ---------- 1680 csv_path : str 1681 Path of the csv file to load 1682 dependent_field : str = None, optional 1683 Name of the dependent field, by default None 1684 dependency_field : str = None, optional 1685 Name of the dependency field, by default None 1686 version_field : str = None, optional 1687 Name of the version field, by default None 1688 dependency_version_field : str = None, optional 1689 Name of the dependency version field, by default None 1690 url_field : str = None, optional 1691 Name of the url field, by default None 1692 default_format : bool, optional 1693 If True, the csv has the structure of full_adjlist.csv, by default False 1694 1695 Examples 1696 -------- 1697 >>> pm = PackageManager.load_csv_adjlist( 1698 "full_adjlist.csv", 1699 dependent_field="dependent", 1700 dependency_field="dependency", 1701 version_field="version", 1702 dependency_version_field="dependency_version", 1703 url_field="url" 1704 ) 1705 >>> pm = PackageManager.load_csv_adjlist("full_adjlist.csv", default_format=True) 1706 1707 ''' 1708 1709 # Init the logger for the package manager 1710 logger = MyLogger.get_logger('logger_packagemanager') 1711 1712 try: 1713 logger.info(f"Loading csv file from {csv_path}") 1714 data = pd.read_csv(csv_path) 1715 except Exception as e: 1716 logger.error(f"Error loading csv file: {e}") 1717 raise PackageManagerLoadError(f"Error loading csv file: {e}") from e 1718 1719 csv_fields = [] 1720 1721 if default_format: 1722 # If the csv has the structure of full_adjlist.csv, we use the default fields 1723 dependent_field = 'name' 1724 dependency_field = 'dependency' 1725 version_field = 'version' 1726 dependency_version_field = 'dependency_version' 1727 url_field = 'url' 1728 csv_fields = [dependent_field, dependency_field, 1729 version_field, dependency_version_field, url_field] 1730 else: 1731 if dependent_field is None or dependency_field is None: 1732 raise PackageManagerLoadError( 1733 "Dependent and dependency fields must be specified") 1734 1735 csv_fields = [dependent_field, dependency_field] 1736 # If the optional fields are specified, we add them to the list 1737 if version_field is not None: 1738 csv_fields.append(version_field) 1739 if dependency_version_field is not None: 1740 csv_fields.append(dependency_version_field) 1741 if url_field is not None: 1742 csv_fields.append(url_field) 1743 1744 # If the csv does not have the specified fields, we raise an error 1745 if any(col not in data.columns for col in csv_fields): 1746 logger.error("Invalid csv format") 1747 raise PackageManagerLoadError("Invalid csv format") 1748 1749 # We create the data source 1750 data_source = CSVDataSource( 1751 file_path=csv_path, 1752 dependent_field=dependent_field, 1753 dependency_field=dependency_field, 1754 dependent_version_field=version_field, 1755 dependency_version_field=dependency_version_field, 1756 dependent_url_field=url_field 1757 ) 1758 1759 obj = cls([data_source]) 1760 1761 # Add the logger to the package manager 1762 obj.logger = logger 1763 1764 # return the package manager 1765 return obj 1766 1767 def initialize( 1768 self, 1769 package_names: Optional[List[str]] = None, 1770 show_progress: Optional[bool] = False, 1771 chunk_size: Optional[int] = 10000): 1772 ''' 1773 Initializes the package manager by loading the packages from the data source 1774 1775 Parameters 1776 ---------- 1777 package_list : List[str] 1778 List of package names to load, if None, all the packages will be loaded 1779 show_progress : bool 1780 If True, a progress bar will be shown 1781 chunk_size : int 1782 Size of the chunks to load the packages, this is done to avoid memory errors 1783 1784 .. warning:: 1785 1786 For large package lists, this method can take a long time to complete 1787 1788 ''' 1789 1790 # Get package names from the data sources if needed 1791 if package_names is None: 1792 for data_source in self.data_sources: 1793 try: 1794 package_names = data_source.obtain_package_names() 1795 break 1796 except NotImplementedError as e: 1797 self.logger.debug(f"Data source {data_source} does not implement obtain_package_names method: {e}") 1798 continue 1799 except Exception as e: 1800 self.logger.error(f"Error while obtaining package names from data source: {e}") 1801 continue 1802 1803 # Check if the package names are valid 1804 if package_names is None or not isinstance(package_names, list): 1805 raise ValueError("No valid package names found") 1806 1807 # Instantiate the progress bar if needed 1808 progress_bar = tqdm.tqdm( 1809 total=len(package_names), 1810 colour="green", 1811 desc="Loading packages", 1812 unit="packages", 1813 ) if show_progress else None 1814 1815 # Create a chunked list of package names 1816 # This is done to avoid memory errors 1817 package_names_chunked = [package_names[i:i + chunk_size] for i in range(0, len(package_names), chunk_size)] 1818 1819 for package_names in package_names_chunked: 1820 # Obtain the packages data from the data source and store them 1821 self.fetch_packages( 1822 package_names=package_names, 1823 progress_bar=progress_bar, 1824 extend=True 1825 ) 1826 1827 # Close the progress bar if needed 1828 if progress_bar is not None: 1829 progress_bar.close() 1830 1831 def fetch_package(self, package_name: str) -> Union[Package, None]: 1832 ''' 1833 Builds a Package object using the data sources in order until one of them returns a valid package 1834 1835 Parameters 1836 ---------- 1837 package_name : str 1838 Name of the package 1839 1840 Returns 1841 ------- 1842 Union[Package, None] 1843 Package object if the package exists, None otherwise 1844 1845 Examples 1846 -------- 1847 >>> package = package_manager.obtain_package("package_name") 1848 >>> package 1849 <Package: package_name> 1850 ''' 1851 # Obtain the package data from the data sources in order 1852 package_data = None 1853 for data_source in self.data_sources: 1854 1855 if isinstance(data_source, (GithubScraper, CSVDataSource, ScraperDataSource, LibrariesioDataSource)): 1856 package_data = data_source.obtain_package_data(package_name) 1857 else: 1858 package_data = self.get_package(package_name).to_dict() 1859 1860 if package_data is not None: 1861 self.logger.debug(f"Package {package_name} found using {data_source.__class__.__name__}") 1862 break 1863 else: 1864 self.logger.debug(f"Package {package_name} not found using {data_source.__class__.__name__}") 1865 1866 1867 # Return the package if it exists 1868 return None if package_data is None else Package.load(package_data) 1869 1870 def fetch_packages( 1871 self, 1872 package_names: List[str], 1873 progress_bar: Optional[tqdm.tqdm], 1874 extend: bool = False 1875 ) -> List[Package]: 1876 ''' 1877 Builds a list of Package objects using the data sources in order until one of them returns a valid package 1878 1879 Parameters 1880 ---------- 1881 package_names : List[str] 1882 List of package names 1883 progress_bar : tqdm.tqdm 1884 Progress bar to show the progress of the operation 1885 extend : bool 1886 If True, the packages will be added to the existing ones, otherwise, the existing ones will be replaced 1887 1888 Returns 1889 ------- 1890 List[Package] 1891 List of Package objects 1892 1893 Examples 1894 -------- 1895 >>> packages = package_manager.obtain_packages(["package_name_1", "package_name_2"]) 1896 >>> packages 1897 [<Package: package_name_1>, <Package: package_name_2>] 1898 ''' 1899 1900 # Check if the package names are valid 1901 if not isinstance(package_names, list): 1902 raise ValueError("Package names must be a list") 1903 1904 preferred_data_source = self.data_sources[0] 1905 1906 # Return list 1907 packages = [] 1908 1909 # if datasource is instance of ScraperDataSource use the obtain_packages_data method for parallelization 1910 if isinstance(preferred_data_source, ScraperDataSource): 1911 1912 packages_data = [] 1913 data_found, not_found = preferred_data_source.obtain_packages_data( 1914 package_names=package_names, 1915 progress_bar=progress_bar # type: ignore 1916 ) 1917 packages_data.extend(data_found) 1918 # pending_packages = not_found 1919 self.logger.info(f"Packages found: {len(data_found)}, Packages not found: {len(not_found)}") 1920 packages = [Package.load(package_data) for package_data in packages_data] 1921 1922 # if not use the obtain_package_data method for sequential processing using the data_sources of the list 1923 else: 1924 1925 while len(package_names) > 0: 1926 1927 package_name = package_names[0] 1928 package_data = self.fetch_package(package_name) 1929 if package_data is not None: 1930 packages.append(package_data) 1931 1932 # Remove the package from the pending packages 1933 del package_names[0] 1934 1935 if progress_bar is not None: 1936 progress_bar.update(1) 1937 1938 self.logger.info(f"Total packages found: {len(packages)}") 1939 1940 # update the self.packages attribute overwriting the packages with the same name 1941 # but conserving the other packages 1942 if extend: 1943 self.logger.info("Extending data source with obtained packages") 1944 for package in packages: 1945 self.packages[package.name] = package 1946 1947 return packages 1948 1949 def get_package(self, package_name: str) -> Union[Package, None]: 1950 ''' 1951 Obtain a package from the package manager 1952 1953 Parameters 1954 ---------- 1955 package_name : str 1956 Name of the package 1957 1958 Returns 1959 ------- 1960 Union[Package, None] 1961 Package object if the package exists, None otherwise 1962 1963 Examples 1964 -------- 1965 >>> package = package_manager.get_package("package_name") 1966 >>> print(package.name) 1967 ''' 1968 return self.packages.get(package_name, None) 1969 1970 def get_packages(self) -> List[Package]: 1971 ''' 1972 Obtain the list of packages of the package manager 1973 1974 Returns 1975 ------- 1976 List[Package] 1977 List of packages of the package manager 1978 1979 Examples 1980 -------- 1981 >>> package_list = package_manager.get_package_list() 1982 ''' 1983 return list(self.packages.values()) 1984 1985 def package_names(self) -> List[str]: 1986 ''' 1987 Obtain the list of package names of the package manager 1988 1989 Returns 1990 ------- 1991 List[str] 1992 List of package names of the package manager 1993 1994 Examples 1995 -------- 1996 >>> package_names = package_manager.get_package_names() 1997 ''' 1998 return list(self.packages.keys()) 1999 2000 def fetch_package_names(self) -> List[str]: 2001 ''' 2002 Obtain the list of package names of the package manager 2003 2004 Returns 2005 ------- 2006 List[str] 2007 List of package names of the package manager 2008 2009 Examples 2010 -------- 2011 >>> package_names = package_manager.obtain_package_names() 2012 ''' 2013 2014 return self.data_sources[0].obtain_package_names() 2015 2016 def export_dataframe(self, full_data = False) -> pd.DataFrame: 2017 ''' 2018 Convert the object to a adjacency list, where each row represents a dependency 2019 If a package has'nt dependencies, it will appear in the list with dependency field empty 2020 2021 Parameters 2022 ---------- 2023 full_data : bool, optional 2024 If True, the adjacency list will contain the version and url of the packages, by default False 2025 2026 Returns 2027 ------- 2028 pd.DataFrame 2029 Dependency network as an adjacency list 2030 2031 Examples 2032 -------- 2033 >>> adj_list = package_manager.export_adjlist() 2034 >>> print(adj_list) 2035 [name, dependency] 2036 ''' 2037 2038 if not self.packages: 2039 self.logger.debug("The package manager is empty") 2040 return pd.DataFrame() 2041 2042 2043 rows = [] 2044 2045 if full_data: 2046 for package_name in self.packages.keys(): 2047 package = self.get_package(package_name) 2048 2049 2050 for dependency in package.dependencies: 2051 2052 try: 2053 dependency_full = self.get_package(dependency.name) 2054 rows.append( 2055 [package.name, package.version, package.url, dependency_full.name, dependency_full.version, dependency_full.url] 2056 ) 2057 except Exception: 2058 if dependency.name is not None: 2059 rows.append( 2060 [package.name, package.version, package.url, dependency.name, None, None] 2061 ) 2062 2063 2064 return pd.DataFrame(rows, columns=['name', 'version', 'url', 'dependency', 'dependency_version', 'dependency_url']) 2065 else: 2066 for package_name in self.packages.keys(): 2067 package = self.get_package(package_name) 2068 rows.extend( 2069 [package.name, dependency.name] 2070 for dependency in package.dependencies 2071 ) 2072 return pd.DataFrame(rows, columns=['name', 'dependency']) 2073 2074 def get_adjlist(self, package_name: str, adjlist: Optional[Dict] = None, deep_level: int = 5) -> Dict[str, List[str]]: 2075 """ 2076 Generates the dependency network of a package from the data source. 2077 2078 Parameters 2079 ---------- 2080 package_name : str 2081 The name of the package to generate the dependency network 2082 adjlist : Optional[Dict], optional 2083 The dependency network of the package, by default None 2084 deep_level : int, optional 2085 The deep level of the dependency network, by default 5 2086 2087 Returns 2088 ------- 2089 Dict[str, List[str]] 2090 The dependency network of the package 2091 """ 2092 2093 # If the deep level is 0, we return the dependency network (Stop condition) 2094 if deep_level == 0: 2095 return adjlist 2096 2097 # If the dependency network is not specified, we create it (Initial case) 2098 if adjlist is None: 2099 adjlist = {} 2100 2101 # If the package is already in the dependency network, we return it (Stop condition) 2102 if package_name in adjlist: 2103 return adjlist 2104 2105 # Use the data of the package manager 2106 current_package = self.get_package(package_name) 2107 dependencies = current_package.get_dependencies_names() if current_package is not None else [] 2108 2109 # Get the dependencies of the package and add it to the dependency network if it is not already in it 2110 adjlist[package_name] = dependencies 2111 2112 # Append the dependencies of the package to the dependency network 2113 for dependency_name in dependencies: 2114 2115 if (dependency_name not in adjlist) and (self.get_package(dependency_name) is not None): 2116 2117 adjlist = self.get_adjlist( 2118 package_name = dependency_name, 2119 adjlist = adjlist, 2120 deep_level = deep_level - 1, 2121 ) 2122 2123 return adjlist 2124 2125 def fetch_adjlist(self, package_name: str, deep_level: int = 5, adjlist: dict = None) -> Dict[str, List[str]]: 2126 """ 2127 Generates the dependency network of a package from the data source. 2128 2129 Parameters 2130 ---------- 2131 package_name : str 2132 The name of the package to generate the dependency network 2133 deep_level : int, optional 2134 The deep level of the dependency network, by default 5 2135 dependency_network : dict, optional 2136 The dependency network of the package 2137 2138 Returns 2139 ------- 2140 Dict[str, List[str]] 2141 The dependency network of the package 2142 """ 2143 2144 if adjlist is None: 2145 adjlist = {} 2146 2147 # If the deep level is 0, we return the adjacency list (Stop condition) 2148 if deep_level == 0 or package_name in adjlist: 2149 return adjlist 2150 2151 dependencies = [] 2152 try: 2153 current_package = self.fetch_package(package_name) 2154 dependencies = current_package.get_dependencies_names() 2155 2156 except Exception as e: 2157 self.logger.debug(f"Package {package_name} not found: {e}") 2158 2159 # Add the package to the adjacency list if it is not already in it 2160 adjlist[package_name] = dependencies 2161 2162 # Append the dependencies of the package to the adjacency list if they are not already in it 2163 for dependency_name in dependencies: 2164 if dependency_name not in adjlist: 2165 try: 2166 adjlist = self.fetch_adjlist( 2167 package_name=dependency_name, # The name of the dependency 2168 deep_level=deep_level - 1, # The deep level is reduced by 1 2169 adjlist=adjlist # The global adjacency list 2170 ) 2171 except Exception: 2172 self.logger.debug( 2173 f"The package {dependency_name}, as dependency of {package_name} does not exist in the data source" 2174 ) 2175 2176 return adjlist 2177 2178 def __add_chunk(self, 2179 df, G, 2180 filter_field=None, 2181 filter_value=None 2182 ): 2183 2184 filtered = df[df[filter_field] == filter_value] if filter_field else df 2185 links = list(zip(filtered["name"], filtered["dependency"])) 2186 G.add_edges_from(links) 2187 return G 2188 2189 def get_network_graph( 2190 self, chunk_size = int(1e6), 2191 source_field = "dependency", target_field = "name", 2192 filter_field=None, filter_value=None) -> nx.DiGraph: 2193 """ 2194 Builds a dependency network graph from a dataframe of dependencies. 2195 The dataframe must have two columns: dependent and dependency. 2196 2197 Parameters 2198 ---------- 2199 chunk_size : int 2200 Number of rows to process at a time 2201 source_field : str 2202 Name of the column containing the source node 2203 target_field : str 2204 Name of the column containing the target node 2205 filter_field : str, optional 2206 Name of the column to filter on, by default None 2207 filter_value : str, optional 2208 Value to filter on, by default None 2209 2210 Returns 2211 ------- 2212 nx.DiGraph 2213 Directed graph of dependencies 2214 """ 2215 2216 2217 # If the default dtasource is a CSV_Datasource, we use custom implementation 2218 defaul_datasource = self.__get_default_datasource() 2219 if isinstance(defaul_datasource, CSVDataSource): 2220 return nx.from_pandas_edgelist( 2221 defaul_datasource.data, source=source_field, 2222 target=target_field, create_using=nx.DiGraph() 2223 ) 2224 2225 # If the default datasource is not a CSV_Datasource, we use the default implementation 2226 df = self.export_dataframe() 2227 try: 2228 # New NetworkX directed Graph 2229 G = nx.DiGraph() 2230 2231 for i in range(0, len(df), chunk_size): 2232 chunk = df.iloc[i:i+chunk_size] 2233 # Add dependencies from chunk to G 2234 G = self.__add_chunk( 2235 chunk, 2236 G, 2237 filter_field=filter_field, 2238 filter_value=filter_value 2239 ) 2240 2241 return G 2242 2243 except Exception as e: 2244 print('\n', e) 2245 2246 def get_transitive_network_graph(self, package_name: str, deep_level: int = 5, generate = False) -> nx.DiGraph: 2247 """ 2248 Gets the transitive dependency network of a package as a NetworkX graph. 2249 2250 Parameters 2251 ---------- 2252 package_name : str 2253 The name of the package to get the dependency network 2254 deep_level : int, optional 2255 The deep level of the dependency network, by default 5 2256 generate : bool, optional 2257 If True, the dependency network is generated from the data source, by default False 2258 2259 Returns 2260 ------- 2261 nx.DiGraph 2262 The dependency network of the package 2263 """ 2264 2265 if generate: 2266 # Get the dependency network from the data source 2267 dependency_network = self.fetch_adjlist(package_name=package_name, deep_level=deep_level, adjlist={}) 2268 2269 else: 2270 # Get the dependency network from in-memory data 2271 dependency_network = self.get_adjlist(package_name=package_name, deep_level=deep_level) 2272 2273 # Create a NetworkX graph of the dependency network as (DEPENDENCY ---> PACKAGE) 2274 G = nx.DiGraph() 2275 for package_name, dependencies in dependency_network.items(): 2276 for dependency_name in dependencies: 2277 G.add_edge(dependency_name, package_name) 2278 2279 return G 2280 2281 def __get_default_datasource(self): 2282 """ 2283 Gets the default data source 2284 2285 Returns 2286 ------- 2287 DataSource 2288 The default data source 2289 """ 2290 2291 return self.data_sources[0] if len(self.data_sources) > 0 else None
Class that represents a package manager, which provides a way to obtain packages from a data source and store them in a dictionary
1561 def __init__(self, data_sources: Optional[List[DataSource]] = None): 1562 ''' 1563 Constructor of the PackageManager class 1564 1565 Parameters 1566 ---------- 1567 1568 data_sources : Optional[List[DataSource]] 1569 List of data sources to obtain the packages, if None, an empty list will be used 1570 1571 Raises 1572 ------ 1573 ValueError 1574 If the data_sources parameter is None or empty 1575 1576 Examples 1577 -------- 1578 >>> package_manager = PackageManager("My package manager", [CSVDataSource("csv_data_source", "path/to/file.csv")]) 1579 ''' 1580 1581 if not data_sources: 1582 raise ValueError("Data source cannot be empty") 1583 1584 self.data_sources: List[DataSource] = data_sources 1585 self.packages: Dict[str, Package] = {} 1586 1587 # Init the logger for the package manager 1588 self.logger = MyLogger.get_logger('logger_packagemanager')
Constructor of the PackageManager class
Parameters
- data_sources (Optional[List[DataSource]]): List of data sources to obtain the packages, if None, an empty list will be used
Raises
- ValueError: If the data_sources parameter is None or empty
Examples
>>> package_manager = PackageManager("My package manager", [CSVDataSource("csv_data_source", "path/to/file.csv")])
1591 def save(self, path: str): 1592 ''' 1593 Saves the package manager to a file, normally it has the extension .olvpm for easy identification 1594 as an Olivia package manager file 1595 1596 Parameters 1597 ---------- 1598 path : str 1599 Path of the file to save the package manager 1600 ''' 1601 1602 # Remove redundant objects 1603 for data_source in self.data_sources: 1604 if isinstance(data_source, ScraperDataSource): 1605 try: 1606 del data_source.request_handler 1607 except AttributeError: 1608 pass 1609 1610 try: 1611 1612 # Use pickle to save the package manager 1613 with open(path, "wb") as f: 1614 pickle.dump(self, f, protocol=pickle.HIGHEST_PROTOCOL) 1615 1616 except Exception as e: 1617 raise PackageManagerSaveError(f"Error saving package manager: {e}") from e
Saves the package manager to a file, normally it has the extension .olvpm for easy identification as an Olivia package manager file
Parameters
- path (str): Path of the file to save the package manager
1619 @classmethod 1620 def load_from_persistence(cls, path: str): 1621 '''_fro 1622 Load the package manager from a file, the file must have been created with the save method 1623 Normally, it has the extension .olvpm 1624 1625 Parameters 1626 ---------- 1627 path : str 1628 Path of the file to load the package manager 1629 1630 Returns 1631 ------- 1632 Union[PackageManager, None] 1633 PackageManager object if the file exists and is valid, None otherwise 1634 ''' 1635 1636 # Init the logger for the package manager 1637 logger = MyLogger.get_logger("logger_packagemanager") 1638 1639 # Try to load the package manager from the file 1640 try: 1641 # Use pickle to load the package manager 1642 logger.info(f"Loading package manager from {path}") 1643 with open(path, "rb") as f: 1644 obj = pickle.load(f) 1645 logger.info("Package manager loaded") 1646 except PackageManagerLoadError: 1647 logger.error(f"Error loading package manager from {path}") 1648 return None 1649 1650 if not isinstance(obj, PackageManager): 1651 return None 1652 1653 # Set the request handler for the scraper data sources 1654 for data_source in obj.data_sources: 1655 if isinstance(data_source, ScraperDataSource): 1656 data_source.request_handler = RequestHandler() 1657 # Set the logger for the scraper data source 1658 data_source.logger = MyLogger.get_logger("logger_datasource") 1659 1660 obj.logger = logger 1661 1662 return obj
_fro Load the package manager from a file, the file must have been created with the save method Normally, it has the extension .olvpm
Parameters
- path (str): Path of the file to load the package manager
Returns
- Union[PackageManager, None]: PackageManager object if the file exists and is valid, None otherwise
1664 @classmethod 1665 def load_from_csv( 1666 cls, 1667 csv_path: str, 1668 dependent_field: Optional[str] = None, 1669 dependency_field: Optional[str] = None, 1670 version_field: Optional[str] = None, 1671 dependency_version_field: Optional[str] = None, 1672 url_field: Optional[str] = None, 1673 default_format: Optional[bool] = False, 1674 ) -> PackageManager: 1675 ''' 1676 Load a csv file into a PackageManager object 1677 1678 Parameters 1679 ---------- 1680 csv_path : str 1681 Path of the csv file to load 1682 dependent_field : str = None, optional 1683 Name of the dependent field, by default None 1684 dependency_field : str = None, optional 1685 Name of the dependency field, by default None 1686 version_field : str = None, optional 1687 Name of the version field, by default None 1688 dependency_version_field : str = None, optional 1689 Name of the dependency version field, by default None 1690 url_field : str = None, optional 1691 Name of the url field, by default None 1692 default_format : bool, optional 1693 If True, the csv has the structure of full_adjlist.csv, by default False 1694 1695 Examples 1696 -------- 1697 >>> pm = PackageManager.load_csv_adjlist( 1698 "full_adjlist.csv", 1699 dependent_field="dependent", 1700 dependency_field="dependency", 1701 version_field="version", 1702 dependency_version_field="dependency_version", 1703 url_field="url" 1704 ) 1705 >>> pm = PackageManager.load_csv_adjlist("full_adjlist.csv", default_format=True) 1706 1707 ''' 1708 1709 # Init the logger for the package manager 1710 logger = MyLogger.get_logger('logger_packagemanager') 1711 1712 try: 1713 logger.info(f"Loading csv file from {csv_path}") 1714 data = pd.read_csv(csv_path) 1715 except Exception as e: 1716 logger.error(f"Error loading csv file: {e}") 1717 raise PackageManagerLoadError(f"Error loading csv file: {e}") from e 1718 1719 csv_fields = [] 1720 1721 if default_format: 1722 # If the csv has the structure of full_adjlist.csv, we use the default fields 1723 dependent_field = 'name' 1724 dependency_field = 'dependency' 1725 version_field = 'version' 1726 dependency_version_field = 'dependency_version' 1727 url_field = 'url' 1728 csv_fields = [dependent_field, dependency_field, 1729 version_field, dependency_version_field, url_field] 1730 else: 1731 if dependent_field is None or dependency_field is None: 1732 raise PackageManagerLoadError( 1733 "Dependent and dependency fields must be specified") 1734 1735 csv_fields = [dependent_field, dependency_field] 1736 # If the optional fields are specified, we add them to the list 1737 if version_field is not None: 1738 csv_fields.append(version_field) 1739 if dependency_version_field is not None: 1740 csv_fields.append(dependency_version_field) 1741 if url_field is not None: 1742 csv_fields.append(url_field) 1743 1744 # If the csv does not have the specified fields, we raise an error 1745 if any(col not in data.columns for col in csv_fields): 1746 logger.error("Invalid csv format") 1747 raise PackageManagerLoadError("Invalid csv format") 1748 1749 # We create the data source 1750 data_source = CSVDataSource( 1751 file_path=csv_path, 1752 dependent_field=dependent_field, 1753 dependency_field=dependency_field, 1754 dependent_version_field=version_field, 1755 dependency_version_field=dependency_version_field, 1756 dependent_url_field=url_field 1757 ) 1758 1759 obj = cls([data_source]) 1760 1761 # Add the logger to the package manager 1762 obj.logger = logger 1763 1764 # return the package manager 1765 return obj
Load a csv file into a PackageManager object
Parameters
- csv_path (str): Path of the csv file to load
- dependent_field (str = None, optional): Name of the dependent field, by default None
- dependency_field (str = None, optional): Name of the dependency field, by default None
- version_field (str = None, optional): Name of the version field, by default None
- dependency_version_field (str = None, optional): Name of the dependency version field, by default None
- url_field (str = None, optional): Name of the url field, by default None
- default_format (bool, optional): If True, the csv has the structure of full_adjlist.csv, by default False
Examples
>>> pm = PackageManager.load_csv_adjlist(
"full_adjlist.csv",
dependent_field="dependent",
dependency_field="dependency",
version_field="version",
dependency_version_field="dependency_version",
url_field="url"
)
>>> pm = PackageManager.load_csv_adjlist("full_adjlist.csv", default_format=True)
1767 def initialize( 1768 self, 1769 package_names: Optional[List[str]] = None, 1770 show_progress: Optional[bool] = False, 1771 chunk_size: Optional[int] = 10000): 1772 ''' 1773 Initializes the package manager by loading the packages from the data source 1774 1775 Parameters 1776 ---------- 1777 package_list : List[str] 1778 List of package names to load, if None, all the packages will be loaded 1779 show_progress : bool 1780 If True, a progress bar will be shown 1781 chunk_size : int 1782 Size of the chunks to load the packages, this is done to avoid memory errors 1783 1784 .. warning:: 1785 1786 For large package lists, this method can take a long time to complete 1787 1788 ''' 1789 1790 # Get package names from the data sources if needed 1791 if package_names is None: 1792 for data_source in self.data_sources: 1793 try: 1794 package_names = data_source.obtain_package_names() 1795 break 1796 except NotImplementedError as e: 1797 self.logger.debug(f"Data source {data_source} does not implement obtain_package_names method: {e}") 1798 continue 1799 except Exception as e: 1800 self.logger.error(f"Error while obtaining package names from data source: {e}") 1801 continue 1802 1803 # Check if the package names are valid 1804 if package_names is None or not isinstance(package_names, list): 1805 raise ValueError("No valid package names found") 1806 1807 # Instantiate the progress bar if needed 1808 progress_bar = tqdm.tqdm( 1809 total=len(package_names), 1810 colour="green", 1811 desc="Loading packages", 1812 unit="packages", 1813 ) if show_progress else None 1814 1815 # Create a chunked list of package names 1816 # This is done to avoid memory errors 1817 package_names_chunked = [package_names[i:i + chunk_size] for i in range(0, len(package_names), chunk_size)] 1818 1819 for package_names in package_names_chunked: 1820 # Obtain the packages data from the data source and store them 1821 self.fetch_packages( 1822 package_names=package_names, 1823 progress_bar=progress_bar, 1824 extend=True 1825 ) 1826 1827 # Close the progress bar if needed 1828 if progress_bar is not None: 1829 progress_bar.close()
Initializes the package manager by loading the packages from the data source
Parameters
----------
package_list : List[str]
List of package names to load, if None, all the packages will be loaded
show_progress : bool
If True, a progress bar will be shown
chunk_size : int
Size of the chunks to load the packages, this is done to avoid memory errors
For large package lists, this method can take a long time to complete
1831 def fetch_package(self, package_name: str) -> Union[Package, None]: 1832 ''' 1833 Builds a Package object using the data sources in order until one of them returns a valid package 1834 1835 Parameters 1836 ---------- 1837 package_name : str 1838 Name of the package 1839 1840 Returns 1841 ------- 1842 Union[Package, None] 1843 Package object if the package exists, None otherwise 1844 1845 Examples 1846 -------- 1847 >>> package = package_manager.obtain_package("package_name") 1848 >>> package 1849 <Package: package_name> 1850 ''' 1851 # Obtain the package data from the data sources in order 1852 package_data = None 1853 for data_source in self.data_sources: 1854 1855 if isinstance(data_source, (GithubScraper, CSVDataSource, ScraperDataSource, LibrariesioDataSource)): 1856 package_data = data_source.obtain_package_data(package_name) 1857 else: 1858 package_data = self.get_package(package_name).to_dict() 1859 1860 if package_data is not None: 1861 self.logger.debug(f"Package {package_name} found using {data_source.__class__.__name__}") 1862 break 1863 else: 1864 self.logger.debug(f"Package {package_name} not found using {data_source.__class__.__name__}") 1865 1866 1867 # Return the package if it exists 1868 return None if package_data is None else Package.load(package_data)
Builds a Package object using the data sources in order until one of them returns a valid package
Parameters
- package_name (str): Name of the package
Returns
- Union[Package, None]: Package object if the package exists, None otherwise
Examples
>>> package = package_manager.obtain_package("package_name")
>>> package
<Package: package_name>
1870 def fetch_packages( 1871 self, 1872 package_names: List[str], 1873 progress_bar: Optional[tqdm.tqdm], 1874 extend: bool = False 1875 ) -> List[Package]: 1876 ''' 1877 Builds a list of Package objects using the data sources in order until one of them returns a valid package 1878 1879 Parameters 1880 ---------- 1881 package_names : List[str] 1882 List of package names 1883 progress_bar : tqdm.tqdm 1884 Progress bar to show the progress of the operation 1885 extend : bool 1886 If True, the packages will be added to the existing ones, otherwise, the existing ones will be replaced 1887 1888 Returns 1889 ------- 1890 List[Package] 1891 List of Package objects 1892 1893 Examples 1894 -------- 1895 >>> packages = package_manager.obtain_packages(["package_name_1", "package_name_2"]) 1896 >>> packages 1897 [<Package: package_name_1>, <Package: package_name_2>] 1898 ''' 1899 1900 # Check if the package names are valid 1901 if not isinstance(package_names, list): 1902 raise ValueError("Package names must be a list") 1903 1904 preferred_data_source = self.data_sources[0] 1905 1906 # Return list 1907 packages = [] 1908 1909 # if datasource is instance of ScraperDataSource use the obtain_packages_data method for parallelization 1910 if isinstance(preferred_data_source, ScraperDataSource): 1911 1912 packages_data = [] 1913 data_found, not_found = preferred_data_source.obtain_packages_data( 1914 package_names=package_names, 1915 progress_bar=progress_bar # type: ignore 1916 ) 1917 packages_data.extend(data_found) 1918 # pending_packages = not_found 1919 self.logger.info(f"Packages found: {len(data_found)}, Packages not found: {len(not_found)}") 1920 packages = [Package.load(package_data) for package_data in packages_data] 1921 1922 # if not use the obtain_package_data method for sequential processing using the data_sources of the list 1923 else: 1924 1925 while len(package_names) > 0: 1926 1927 package_name = package_names[0] 1928 package_data = self.fetch_package(package_name) 1929 if package_data is not None: 1930 packages.append(package_data) 1931 1932 # Remove the package from the pending packages 1933 del package_names[0] 1934 1935 if progress_bar is not None: 1936 progress_bar.update(1) 1937 1938 self.logger.info(f"Total packages found: {len(packages)}") 1939 1940 # update the self.packages attribute overwriting the packages with the same name 1941 # but conserving the other packages 1942 if extend: 1943 self.logger.info("Extending data source with obtained packages") 1944 for package in packages: 1945 self.packages[package.name] = package 1946 1947 return packages
Builds a list of Package objects using the data sources in order until one of them returns a valid package
Parameters
- package_names (List[str]): List of package names
- progress_bar (tqdm.tqdm): Progress bar to show the progress of the operation
- extend (bool): If True, the packages will be added to the existing ones, otherwise, the existing ones will be replaced
Returns
- List[Package]: List of Package objects
Examples
>>> packages = package_manager.obtain_packages(["package_name_1", "package_name_2"])
>>> packages
[<Package: package_name_1>, <Package: package_name_2>]
1949 def get_package(self, package_name: str) -> Union[Package, None]: 1950 ''' 1951 Obtain a package from the package manager 1952 1953 Parameters 1954 ---------- 1955 package_name : str 1956 Name of the package 1957 1958 Returns 1959 ------- 1960 Union[Package, None] 1961 Package object if the package exists, None otherwise 1962 1963 Examples 1964 -------- 1965 >>> package = package_manager.get_package("package_name") 1966 >>> print(package.name) 1967 ''' 1968 return self.packages.get(package_name, None)
Obtain a package from the package manager
Parameters
- package_name (str): Name of the package
Returns
- Union[Package, None]: Package object if the package exists, None otherwise
Examples
>>> package = package_manager.get_package("package_name")
>>> print(package.name)
1970 def get_packages(self) -> List[Package]: 1971 ''' 1972 Obtain the list of packages of the package manager 1973 1974 Returns 1975 ------- 1976 List[Package] 1977 List of packages of the package manager 1978 1979 Examples 1980 -------- 1981 >>> package_list = package_manager.get_package_list() 1982 ''' 1983 return list(self.packages.values())
Obtain the list of packages of the package manager
Returns
- List[Package]: List of packages of the package manager
Examples
>>> package_list = package_manager.get_package_list()
1985 def package_names(self) -> List[str]: 1986 ''' 1987 Obtain the list of package names of the package manager 1988 1989 Returns 1990 ------- 1991 List[str] 1992 List of package names of the package manager 1993 1994 Examples 1995 -------- 1996 >>> package_names = package_manager.get_package_names() 1997 ''' 1998 return list(self.packages.keys())
Obtain the list of package names of the package manager
Returns
- List[str]: List of package names of the package manager
Examples
>>> package_names = package_manager.get_package_names()
2000 def fetch_package_names(self) -> List[str]: 2001 ''' 2002 Obtain the list of package names of the package manager 2003 2004 Returns 2005 ------- 2006 List[str] 2007 List of package names of the package manager 2008 2009 Examples 2010 -------- 2011 >>> package_names = package_manager.obtain_package_names() 2012 ''' 2013 2014 return self.data_sources[0].obtain_package_names()
Obtain the list of package names of the package manager
Returns
- List[str]: List of package names of the package manager
Examples
>>> package_names = package_manager.obtain_package_names()
2016 def export_dataframe(self, full_data = False) -> pd.DataFrame: 2017 ''' 2018 Convert the object to a adjacency list, where each row represents a dependency 2019 If a package has'nt dependencies, it will appear in the list with dependency field empty 2020 2021 Parameters 2022 ---------- 2023 full_data : bool, optional 2024 If True, the adjacency list will contain the version and url of the packages, by default False 2025 2026 Returns 2027 ------- 2028 pd.DataFrame 2029 Dependency network as an adjacency list 2030 2031 Examples 2032 -------- 2033 >>> adj_list = package_manager.export_adjlist() 2034 >>> print(adj_list) 2035 [name, dependency] 2036 ''' 2037 2038 if not self.packages: 2039 self.logger.debug("The package manager is empty") 2040 return pd.DataFrame() 2041 2042 2043 rows = [] 2044 2045 if full_data: 2046 for package_name in self.packages.keys(): 2047 package = self.get_package(package_name) 2048 2049 2050 for dependency in package.dependencies: 2051 2052 try: 2053 dependency_full = self.get_package(dependency.name) 2054 rows.append( 2055 [package.name, package.version, package.url, dependency_full.name, dependency_full.version, dependency_full.url] 2056 ) 2057 except Exception: 2058 if dependency.name is not None: 2059 rows.append( 2060 [package.name, package.version, package.url, dependency.name, None, None] 2061 ) 2062 2063 2064 return pd.DataFrame(rows, columns=['name', 'version', 'url', 'dependency', 'dependency_version', 'dependency_url']) 2065 else: 2066 for package_name in self.packages.keys(): 2067 package = self.get_package(package_name) 2068 rows.extend( 2069 [package.name, dependency.name] 2070 for dependency in package.dependencies 2071 ) 2072 return pd.DataFrame(rows, columns=['name', 'dependency'])
Convert the object to a adjacency list, where each row represents a dependency If a package has'nt dependencies, it will appear in the list with dependency field empty
Parameters
- full_data (bool, optional): If True, the adjacency list will contain the version and url of the packages, by default False
Returns
- pd.DataFrame: Dependency network as an adjacency list
Examples
>>> adj_list = package_manager.export_adjlist()
>>> print(adj_list)
[name, dependency]
2074 def get_adjlist(self, package_name: str, adjlist: Optional[Dict] = None, deep_level: int = 5) -> Dict[str, List[str]]: 2075 """ 2076 Generates the dependency network of a package from the data source. 2077 2078 Parameters 2079 ---------- 2080 package_name : str 2081 The name of the package to generate the dependency network 2082 adjlist : Optional[Dict], optional 2083 The dependency network of the package, by default None 2084 deep_level : int, optional 2085 The deep level of the dependency network, by default 5 2086 2087 Returns 2088 ------- 2089 Dict[str, List[str]] 2090 The dependency network of the package 2091 """ 2092 2093 # If the deep level is 0, we return the dependency network (Stop condition) 2094 if deep_level == 0: 2095 return adjlist 2096 2097 # If the dependency network is not specified, we create it (Initial case) 2098 if adjlist is None: 2099 adjlist = {} 2100 2101 # If the package is already in the dependency network, we return it (Stop condition) 2102 if package_name in adjlist: 2103 return adjlist 2104 2105 # Use the data of the package manager 2106 current_package = self.get_package(package_name) 2107 dependencies = current_package.get_dependencies_names() if current_package is not None else [] 2108 2109 # Get the dependencies of the package and add it to the dependency network if it is not already in it 2110 adjlist[package_name] = dependencies 2111 2112 # Append the dependencies of the package to the dependency network 2113 for dependency_name in dependencies: 2114 2115 if (dependency_name not in adjlist) and (self.get_package(dependency_name) is not None): 2116 2117 adjlist = self.get_adjlist( 2118 package_name = dependency_name, 2119 adjlist = adjlist, 2120 deep_level = deep_level - 1, 2121 ) 2122 2123 return adjlist
Generates the dependency network of a package from the data source.
Parameters
- package_name (str): The name of the package to generate the dependency network
- adjlist (Optional[Dict], optional): The dependency network of the package, by default None
- deep_level (int, optional): The deep level of the dependency network, by default 5
Returns
- Dict[str, List[str]]: The dependency network of the package
2125 def fetch_adjlist(self, package_name: str, deep_level: int = 5, adjlist: dict = None) -> Dict[str, List[str]]: 2126 """ 2127 Generates the dependency network of a package from the data source. 2128 2129 Parameters 2130 ---------- 2131 package_name : str 2132 The name of the package to generate the dependency network 2133 deep_level : int, optional 2134 The deep level of the dependency network, by default 5 2135 dependency_network : dict, optional 2136 The dependency network of the package 2137 2138 Returns 2139 ------- 2140 Dict[str, List[str]] 2141 The dependency network of the package 2142 """ 2143 2144 if adjlist is None: 2145 adjlist = {} 2146 2147 # If the deep level is 0, we return the adjacency list (Stop condition) 2148 if deep_level == 0 or package_name in adjlist: 2149 return adjlist 2150 2151 dependencies = [] 2152 try: 2153 current_package = self.fetch_package(package_name) 2154 dependencies = current_package.get_dependencies_names() 2155 2156 except Exception as e: 2157 self.logger.debug(f"Package {package_name} not found: {e}") 2158 2159 # Add the package to the adjacency list if it is not already in it 2160 adjlist[package_name] = dependencies 2161 2162 # Append the dependencies of the package to the adjacency list if they are not already in it 2163 for dependency_name in dependencies: 2164 if dependency_name not in adjlist: 2165 try: 2166 adjlist = self.fetch_adjlist( 2167 package_name=dependency_name, # The name of the dependency 2168 deep_level=deep_level - 1, # The deep level is reduced by 1 2169 adjlist=adjlist # The global adjacency list 2170 ) 2171 except Exception: 2172 self.logger.debug( 2173 f"The package {dependency_name}, as dependency of {package_name} does not exist in the data source" 2174 ) 2175 2176 return adjlist
Generates the dependency network of a package from the data source.
Parameters
- package_name (str): The name of the package to generate the dependency network
- deep_level (int, optional): The deep level of the dependency network, by default 5
- dependency_network (dict, optional): The dependency network of the package
Returns
- Dict[str, List[str]]: The dependency network of the package
2189 def get_network_graph( 2190 self, chunk_size = int(1e6), 2191 source_field = "dependency", target_field = "name", 2192 filter_field=None, filter_value=None) -> nx.DiGraph: 2193 """ 2194 Builds a dependency network graph from a dataframe of dependencies. 2195 The dataframe must have two columns: dependent and dependency. 2196 2197 Parameters 2198 ---------- 2199 chunk_size : int 2200 Number of rows to process at a time 2201 source_field : str 2202 Name of the column containing the source node 2203 target_field : str 2204 Name of the column containing the target node 2205 filter_field : str, optional 2206 Name of the column to filter on, by default None 2207 filter_value : str, optional 2208 Value to filter on, by default None 2209 2210 Returns 2211 ------- 2212 nx.DiGraph 2213 Directed graph of dependencies 2214 """ 2215 2216 2217 # If the default dtasource is a CSV_Datasource, we use custom implementation 2218 defaul_datasource = self.__get_default_datasource() 2219 if isinstance(defaul_datasource, CSVDataSource): 2220 return nx.from_pandas_edgelist( 2221 defaul_datasource.data, source=source_field, 2222 target=target_field, create_using=nx.DiGraph() 2223 ) 2224 2225 # If the default datasource is not a CSV_Datasource, we use the default implementation 2226 df = self.export_dataframe() 2227 try: 2228 # New NetworkX directed Graph 2229 G = nx.DiGraph() 2230 2231 for i in range(0, len(df), chunk_size): 2232 chunk = df.iloc[i:i+chunk_size] 2233 # Add dependencies from chunk to G 2234 G = self.__add_chunk( 2235 chunk, 2236 G, 2237 filter_field=filter_field, 2238 filter_value=filter_value 2239 ) 2240 2241 return G 2242 2243 except Exception as e: 2244 print('\n', e)
Builds a dependency network graph from a dataframe of dependencies. The dataframe must have two columns: dependent and dependency.
Parameters
- chunk_size (int): Number of rows to process at a time
- source_field (str): Name of the column containing the source node
- target_field (str): Name of the column containing the target node
- filter_field (str, optional): Name of the column to filter on, by default None
- filter_value (str, optional): Value to filter on, by default None
Returns
- nx.DiGraph: Directed graph of dependencies
2246 def get_transitive_network_graph(self, package_name: str, deep_level: int = 5, generate = False) -> nx.DiGraph: 2247 """ 2248 Gets the transitive dependency network of a package as a NetworkX graph. 2249 2250 Parameters 2251 ---------- 2252 package_name : str 2253 The name of the package to get the dependency network 2254 deep_level : int, optional 2255 The deep level of the dependency network, by default 5 2256 generate : bool, optional 2257 If True, the dependency network is generated from the data source, by default False 2258 2259 Returns 2260 ------- 2261 nx.DiGraph 2262 The dependency network of the package 2263 """ 2264 2265 if generate: 2266 # Get the dependency network from the data source 2267 dependency_network = self.fetch_adjlist(package_name=package_name, deep_level=deep_level, adjlist={}) 2268 2269 else: 2270 # Get the dependency network from in-memory data 2271 dependency_network = self.get_adjlist(package_name=package_name, deep_level=deep_level) 2272 2273 # Create a NetworkX graph of the dependency network as (DEPENDENCY ---> PACKAGE) 2274 G = nx.DiGraph() 2275 for package_name, dependencies in dependency_network.items(): 2276 for dependency_name in dependencies: 2277 G.add_edge(dependency_name, package_name) 2278 2279 return G
Gets the transitive dependency network of a package as a NetworkX graph.
Parameters
- package_name (str): The name of the package to get the dependency network
- deep_level (int, optional): The deep level of the dependency network, by default 5
- generate (bool, optional): If True, the dependency network is generated from the data source, by default False
Returns
- nx.DiGraph: The dependency network of the package
2293class PackageManagerLoadError(Exception): 2294 """ 2295 Exception raised when an error occurs while loading a package manager 2296 2297 Attributes 2298 ---------- 2299 message : str 2300 Error message 2301 """ 2302 2303 def __init__(self, message): 2304 self.message = message 2305 super().__init__(self.message)
Exception raised when an error occurs while loading a package manager
Attributes
- message (str): Error message
Inherited Members
- builtins.BaseException
- with_traceback
2307class PackageManagerSaveError(Exception): 2308 """ 2309 Exception raised when an error occurs while saving a package manager 2310 2311 Attributes 2312 ---------- 2313 message : str 2314 Error message 2315 """ 2316 2317 def __init__(self, message): 2318 self.message = message 2319 super().__init__(self.message)
Exception raised when an error occurs while saving a package manager
Attributes
- message (str): Error message
Inherited Members
- builtins.BaseException
- with_traceback