Summary
Functions for summarising output.
Statistics
dataclass
Staistical summary of results.
Source code in isoslam/summary.py
550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 | |
conversions_threshold
property
writable
Getter method for conversions_threshold.
Returns:
| Type | Description |
|---|---|
int
|
The conversion threshold for counting. |
conversions_var
property
writable
Getter method for conversions_var.
Returns:
| Type | Description |
|---|---|
str
|
The conversions variable. |
directory
property
writable
Getter method for directory.
Returns:
| Type | Description |
|---|---|
str
|
Directory from which output files are loaded. |
file_ext
property
writable
Getter method for file_ext.
Returns:
| Type | Description |
|---|---|
str
|
File extension that is loaded. |
groupby
property
writable
Getter method for groupby.
Returns:
| Type | Description |
|---|---|
list[str]
|
List of variables to groupby. |
regex
property
writable
Getter method for regex.
Returns:
| Type | Description |
|---|---|
str
|
Regex for extracting day/hour/replication from filename. |
shape
property
Getter for the shape of the dataframe.
Returns:
| Type | Description |
|---|---|
tuple[int, int]
|
Shape of the Polars dataframe. |
test_file
property
writable
Getter method for test_file.
Returns:
| Type | Description |
|---|---|
str
|
String pattern of test filename for excluding test file data. |
unique
property
Getter for the number of unique files loaded.
Returns:
| Type | Description |
|---|---|
int
|
Number of unique rows. |
__post_init__()
After initialisation the files are loaded and prepared for analysis.
Source code in isoslam/summary.py
unique_rows(columns=None)
Identify unique rows in the data for a given set of columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns
|
list[str]
|
Columns to use for identifying unique observations. If |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Number of unique rows for the given set of variables. |
Source code in isoslam/summary.py
aggregate_conversions(df, groupby='replicate', converted='one_or_more_conversion')
Subset data where there have not been one or more conversions.
NB : This needs a better description, I've failed to capture the essence of what is being done here.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Summary dataframe aggregated to give counts of one or more conversion. |
required |
groupby
|
str | list[str]
|
Variables to group the data by. |
'replicate'
|
converted
|
str
|
Variable that contains whether conversions have been observed or not. |
'one_or_more_conversion'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Aggregated dataframe. |
Source code in isoslam/summary.py
append_files(file_ext='.tsv', directory=None)
Append a set of files into a Polars DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_ext
|
str
|
File extension to search for results to summarise. |
'.tsv'
|
directory
|
str | Path | None
|
Path on which to search for files with |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A Polars DataFrames of each file found. |
Source code in isoslam/summary.py
derive_weight_within_isoform(df, groupby='assignment', total='conversion_total')
Calculate weighting used for normalised percentages within each isoform across all time points.
Where the number of total reads (across replications) is higher then we are more confident in the percentage of
conversions observed and so we weight the percentages at each time point by the proportion of total counts which
were calculated previously when deriving the percentage of conversions across replicates (with the
_percent_conversions_across_replicates() function).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Dataframe for which weights are to be derived. |
required |
groupby
|
list[str]
|
Grouping for summation of total counts, defaults to |
'assignment'
|
total
|
str
|
Variable that nolds the total number of conversions (across all replicates), default is |
'conversion_total'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with two new columns, the sum of total conversions across replicates and time points
( |
Source code in isoslam/summary.py
extract_day_hour_and_replicate(df, column='filename', regex='^d(\\w+)_(\\w+)hr(\\w+)_')
Extract the hour and replicate from the filename stored in a dataframes column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Polars DataFrame. |
required |
column
|
str
|
The name of the column that holds the filename, default |
'filename'
|
regex
|
str
|
Regular expression pattern to extract the hour and replicate from, default |
'^d(\\w+)_(\\w+)hr(\\w+)_'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Polars DataFrame augmented with the hour and replicate extracted from the filename. |
Source code in isoslam/summary.py
filter_no_conversions(df, groupby='replicate', converted='one_or_more_conversion', test=False)
Filter dataframe for instances where only no conversions have been observed.
NB : This needs a better description, I've failed to capture the essence of what is being done here.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Summary dataframe aggregated to give counts of one or more conversion. |
required |
groupby
|
str | list[str]
|
Variables to group the data by. |
'replicate'
|
converted
|
str
|
Variable that contains whether conversions have been observed or not. |
'one_or_more_conversion'
|
test
|
bool
|
Whether the function is being tested or not. This will prevent a call to |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Aggregated dataframe. |
Source code in isoslam/summary.py
find_read_pairs(df, index_columns=None, assignment='Assignment')
Find instances where there are conversions for both Return and Splice assignments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Polars DataFrame. |
required |
index_columns
|
list[str]
|
List of index columns to select from the dataframe. Should include the unique identifiers, typically
( |
None
|
assignment
|
str
|
Column the defines assignment of events to |
'Assignment'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Polars DataFrame of the |
Source code in isoslam/summary.py
get_groupby(groupby)
Get grouping variables.
.. csv-table:: Possible groupings :header: 'Value','Grouping'
'base',','["Transcript_id", "Strand", "Start", "End"]'
'assignment',','["Transcript_id", "Strand", "Start", "End", "Assignment"]'
'filename','["Transcript_id", "Strand", "Start", "End", "Assignment", "filename"]'
'time','["Transcript_id", "Strand", "Start", "End", "Assignment", "day", "hour"]'
'replicate','["Transcript_id", "Strand", "Start", "End", "Assignment", "day", "hour", "replicate"]'
'None','Value of groupby.'
This is typically ["Transcript_id", "Strand", "Start", "End", "Assignment"] when groupby is None but
returns groupby otherwise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
groupby
|
list[str] | None
|
Variables to groupby. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of variables to group data by. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid value string is passed. |
Source code in isoslam/summary.py
get_one_or_more_conversion(df, groupby='replicate', converted='one_or_more_conversion')
Extract instances where one or more conversion has occurred.
There are some cases where this isn't the case and for a given subset the converted variable, which indicates if
one or more conversion has occurred will only be False For such instances dummy entries are created based on the
groupby variable and appended to the subset of instances where this one or more conversions have been observed.
This function takes as input the results of summary_count() it will not work with intermediate files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Summary dataframe aggregated to give counts of one or more conversion. |
required |
groupby
|
str | list[str]
|
Variables to group the data by. |
'replicate'
|
converted
|
str
|
Variable that contains whether conversions have been observed or not. |
'one_or_more_conversion'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Aggregated dataframe. |
Source code in isoslam/summary.py
merge_average_with_baseline(df_average, df_baseline, join_on='assignment', zero_baseline_remove=True)
Merge a data frame with the baseline measurements.
Typically for this workflow this involves merging the average data frame (across replicates at each of the transcripts/start/end/strand/assignments) with the average at the baseline to allow normalising the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_average
|
DataFrame
|
Polars Dataframe of averaged data. |
required |
df_baseline
|
DataFrame
|
Polars Dataframe of averaged baseline data. |
required |
join_on
|
list[str] | None
|
Variables to join the data frames on, if |
'assignment'
|
zero_baseline_remove
|
bool
|
Remove instances where the baseline percentage conversion is zero. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Averaged and baseline data frame merged on |
Source code in isoslam/summary.py
normalise(df, to_normalise='conversion_percent', baseline='baseline_percent', normalised='normalised_percent')
Normalise variables based on the baseline measurement.
Assumes that you have merged the averaged dataset with the averaged baseline variables so that the parameter of
interest as its related baseline measurement paired with it. Values are normalised by dividing by the baseline value
such that baseline will always start at 1 and subsequent values (time-points) are relative to this and show
increases or decreases. Typically these will be relative changes in the (averaged) percentage of conversions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Dataframe from |
required |
to_normalise
|
str
|
Variable to be normalised, default is |
'conversion_percent'
|
baseline
|
str
|
Variable to use for normalising, default is |
'baseline_percent'
|
normalised
|
str
|
Variable name for normalised value, default is |
'normalised_percent'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Polars dataframe with normalised values. |
Source code in isoslam/summary.py
percent_conversions_across_replicates(df, groupby='time', count='conversion_count', total='conversion_total')
Percentage of conversions across replicates for each time point.
The raw counts and total conversions for each replicate are available. These are summed and the percentage of conversions across replicates calculated. This is mathematically the same as taking the weighted mean of the percentage of conversions within each replicate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Polars Dataframe of conversions. |
required |
groupby
|
str | list[str]
|
Variables to |
'time'
|
count
|
str
|
Variable/column name holding the counts, default is |
'conversion_count'
|
total
|
str
|
Variable/column name holding the total number of conversions, default is |
'conversion_total'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Weighted mean of the percentage of conversions (weighted by total conversions) across replicates for the given
transcript/assignment/strand/day/hour (as specified by |
Source code in isoslam/summary.py
remove_zero_baseline(df, groupby='base', percent_col=None)
Remove data where the percentage change at baseline is zero.
Removes all observations for a transcript/strand/start/end/assignment where the percentage change at baseline is
zero. Such instances need removing because the data is normalised by the baseline measurement and division by zero
leads to NaN/Inf data points which can not analysed in any meaningful way.
Typically this should be run on the data after averaging across replicates since the percentage change is calculated across all replicates and any observation with zero percentage changes could still contribute to the total number of events. There is however nothing preventing the function from being used on data prior to averaging but that would be atypical usage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Polars DataFrame with percentage changes at each time point for transcript/strand/start/end/assignment. |
required |
groupby
|
str | list[str]
|
Grouping of variables to look within for baseline of zero percent change. Default is |
'base'
|
percent_col
|
str
|
Column name that holds the percentage, defaults to 'conversion_percent' if not specified. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Polars DataFrame with groups where the percent change at baseline is zero removed. |
Source code in isoslam/summary.py
select_base_levels(df, base_day=0, base_hour=0)
Select the base level reference across all data.
This allows selecting the base level of totals and percents which are used for normalising values. Will drop the
column replicate from the data frame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Polars Dataframe of conversions. |
required |
base_day
|
int
|
Day to be used for reference, default is |
0
|
base_hour
|
int
|
Hour to be used for reference, default is |
0
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Subset of data with values at baseline (default |
Source code in isoslam/summary.py
summary_counts(file_ext='.tsv', directory=None, groupby=None, conversions_var='Conversions', conversions_threshold=1, test_file='no4sU', filename_var=None, regex=None)
Group the data and count by various factors.
Typically though we want to know whether conversions have happened or not and this is based on the Conversions >=
1, but this is configurable via the conversions_var and conversions_threshold parameters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_ext
|
str
|
File extension to search for results to summarise. |
'.tsv'
|
directory
|
str | Path | None
|
Path on which to search for files with |
None
|
groupby
|
list[str]
|
List of variables to group the counts by, if |
None
|
conversions_var
|
str
|
The column name that holds conversions, default |
'Conversions'
|
conversions_threshold
|
int
|
Threshold for counting conversions, default |
1
|
test_file
|
str | None
|
Unique identifier for test file, files with this string in their names are removed. |
'no4sU'
|
filename_var
|
str | NOne
|
Column that holds filename. |
None
|
regex
|
str
|
Regular expression pattern to extract the hour and replicate from, default |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A Polars DataFrame counting the total conversions, number by whether conversions happened and the percentage. |